Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3)



  • Some of us were discussing what kind of RAID you would use in a 96 drive RAID array today. Of course RAID 10 is the obvious choice, RAID 60 is semi-plausible and RAID 7 (aka RAID 5.3) was originally proposed and designed for a 48 drive array (Sun's Thumper.) RAID 10 would be by far the safest, fastest, most useful choice, but we assume the need for so many disks because we need a crazy amount of capacity which means that RAID 10's loss of 50% capacity would be a costly problem that we'd like to avoid and RAID 60 is only so much better. RAID 7 is, far and away, the most effective capacity option anywhere near this scale. (RAID 7's only real world implementation is ZFS' RAIDZ3.)

    Since we can only assume a need for very large capacity at a spindle scale of this nature, we must also assume large drives of 8TB or more. If smaller drives were an option, then we'd choose a smaller spindle count to solve some of our cost problems (just factoring in real world market pressures.) So we assume two things: capacity is the primary concern and that therefore the drives must be large for the current market. A typical storage drive today is 4-6TB. 8+ TB would be the large drives and even 8TB, while big, is not an exceptionally large drive. If we were going for 96 spindles, 10TB+ would seem very, very likely.

    So this means that our failure domain is 744TB is a RAID 7 scenario at a minimum, and 930TB at an average and possibly as high as 1.1PB. Not small, at all. Plus we have 96 spindles to contend with. When we have a dozen disks in an array, we worry very, very little about a second drive failing. But the larger an array gets the higher the likelihood that another drive will fail both because there are more drives to fail and because the bigger the array gets, the longer it takes to replace a failed drive (when using parity rather than mirroring) so there is more time for a spindle to fail on us. So moving from 9 drives in an array to 96 drives in an array, we are increasing the spindles that could potentially fail by ten fold. That's huge.

    The bigger problem, though, is the resilver time for a single drive. Assuming that no secondary drive failure ever happens, how long would a 96 8TB disk RAID 7 array take to resilver once the failed drive has been replaced? In real world examples we see 24 6TB RAID 6 arrays sometimes taking over a month to resilver. The more spindles, the longer the process, the more parity (RAID 7 has three calculations for every two in RAID 6), the longer the process, the bigger the drives, the longer the process.

    We can assume that any array like this will be powered by some beefy processors and have loads of spare RAM, at least we hope. This helps, a little. But that is still a lot of data to be read, computed and written again. If 24 x 6 x 2 comes to one month, what does 96 x 8 x 3 come to? If all of those factors are linear and equal in value to each other (they are not) the ratio would be 288:2304 which is exactly 1:8. So if we see one to one and a half months with the former array, we'll potentially see eight to twelve months on the RAID 7 example here. That's a full year!

    Waiting a year for your array to rebuild is epic. I doubt that it would take this long, but until someone is producing real world rebuild examples of this nature, we have to assume, I believe, that a year is an actual possibility, while on the outside. Going to "so long it is an effective loss" happens at a small fraction of this time, so it is unlikely that anyone is ever going to wait around to discover what a rebuild would have been.



  • And all of that is just looking at resilvering. With four times as many drives as a 24 disk array and a rebuild time eight times longer, that's 32 times as much "disk time" in which we have to risk a second disk failing on us. In a 24 disk RAID 6 array, the rebuild time is so long that the parity abuse of the array induces secondary drive failure often enough that we worry about it quite a bit. Magnify that by thirty two times and we have some very significant risk that in a typical array is just background noise. Our primary failure concerns start to charge simply based on the dramatic shifting of factors.


  • Vendor

    Great job! Thanks for brining it here. So many misconception and black magic assumed about RAID Z3...



  • @KOOLER said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    Great job! Thanks for brining it here. So many misconception and black magic assumed about RAID Z3...

    It's a great technology.... with very limited use cases.



  • Seems like the perfect case to use RAIN, even if it's within a single system enclosure. @StarWind_Software LSFS, I'm looking at you. @KOOLER I am right in thinking this is the sort of thing LSFS could handle, right?



  • @travisdh1 said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    Seems like the perfect case to use RAIN, even if it's within a single system enclosure. @StarWind_Software LSFS, I'm looking at you. @KOOLER I am right in thinking this is the sort of thing LSFS could handle, right?

    RAIN in a single enclosure rarely does anything that RAID 10 does not. It's effectively all the same at that point (more or less.) If RAID 10 doesn't work, RAIN isn't going to work either (normally.) The issue here is "single enclosure."



  • @scottalanmiller said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    @travisdh1 said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    Seems like the perfect case to use RAIN, even if it's within a single system enclosure. @StarWind_Software LSFS, I'm looking at you. @KOOLER I am right in thinking this is the sort of thing LSFS could handle, right?

    RAIN in a single enclosure rarely does anything that RAID 10 does not. It's effectively all the same at that point (more or less.) If RAID 10 doesn't work, RAIN isn't going to work either (normally.) The issue here is "single enclosure."

    True. Either way..... makes me go, yuck.



  • This is exactly what @SeanExablox tackles. They do pure storage on RAIN that spans multiple enclosures.



  • @scottalanmiller said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    @travisdh1 said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    Seems like the perfect case to use RAIN, even if it's within a single system enclosure. @StarWind_Software LSFS, I'm looking at you. @KOOLER I am right in thinking this is the sort of thing LSFS could handle, right?

    RAIN in a single enclosure rarely does anything that RAID 10 does not. It's effectively all the same at that point (more or less.) If RAID 10 doesn't work, RAIN isn't going to work either (normally.) The issue here is "single enclosure."

    Wouldn't a properly configured single RAIN node make it easier to grow when it's time to add more storage?

    I've seen this with Exablox and it was a nice feature!



  • @dafyre said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    @scottalanmiller said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    @travisdh1 said in Rebuild Time on a 96 Drive RAID 7 Array (RAIDZ3):

    Seems like the perfect case to use RAIN, even if it's within a single system enclosure. @StarWind_Software LSFS, I'm looking at you. @KOOLER I am right in thinking this is the sort of thing LSFS could handle, right?

    RAIN in a single enclosure rarely does anything that RAID 10 does not. It's effectively all the same at that point (more or less.) If RAID 10 doesn't work, RAIN isn't going to work either (normally.) The issue here is "single enclosure."

    Wouldn't a properly configured single RAIN node make it easier to grow when it's time to add more storage?

    I've seen this with Exablox and it was a nice feature!

    Yes, if you are preparing for scale out. But if you are just doing it within the context of a single node, it doesn't change anything.