Failed Drives on Our Scale HC3 Cluster at Colocation America

  • Many of you know that we have a large Scale HC3 cluster located in Los Angeles with Colocation America. We recently had two drives fail (it's a large cluster with a lot of drives, these were in different hosts.) They didn't fail at the same time, but within a relatively short span of time. So I thought that I would share my experiences with it.

    First of all, the cluster rebalanced automatically, which I knew that it would do but I had not seen this happen in a loss of disk scenario previously. The missing disks were marked and removed from the RAIN system, and the storage redundancy moved to other disks in the cluster. So from the workload perspective, nothing had happened. The system even rebalanced workloads to properly situate workloads on the modified nodes.

    The cluster alerted us that disks had failed and I just opened a ticket with Scale, they logged in and identified the disks and their models, that needed to be replaced and new drives were shipped out directly to Colocation America's processing facility. So the drives never had to come to our hands at all.

    Colocation America got the drives the next day and then went down and got on the phone with us to verify that procedure as they swapped out the drives. The lights on the cluster made it easy to point them to the drives that had failed. Upon replacement the cluster immediately changed status to show that the new drives had come online and the SCRIBE RAIN system started rebalancing the on-disk storage to make use of the new drives.

    We just confirmed with Colocation America that the swap had gone correctly, and then they shipped the failed drives back to Scale Computing once everything was done.

    All happened as expected, but it's nice to see these things in action. The whole process was incredibly smooth and hands off. The combination of the high availability cluster, the RAIN-based SCRIBE storage system, and using an enterprise colocation facility made the process not just painless and quick, but actually effortless and could easily have been coordinated by someone outside of IT.

    @scale @colocationamerica

  • Magic Unicorn Farts