Why Non-Uniform URE Distribution May Make Parity RAID Riskier Than Thought
In a recent discussion about testing resilver (rebuild) rates in parity RAID arrays my friend Chris Natapolis was discussing how in parity resilver tests success rates were extremely high and asking why, in the face of his and other people's ongoing tests of this nature, it is still often believed that URE rates are as high as manufacturers claim and why the continuing concern over resilver success in parity arrays.
In the thread we discussed the potential for non-uniform distribution of failure rates (both drive failure and URE) that could cause problems in tests potentially causing tests to not reflect real world risks. The problem is, of course, that large scale tests are not available and the best data that we have to work from are a few large-ish tests (famously the Google and BackBlaze reports), math models based on manufacturer's stated URE rates and anecdotal evidence.
In some anecdotal tests, that are supported or not unsupported by all other evidence, is that URE distribution is indeed non-uniform and that factors, such as age and wear and tear, likely cause URE rates to increase dramatically. This means, we surmise, that testing array resilver operations could be showing that under non-failure conditions arrays are far more likely than average to resilver successfully and that when under failure conditions that arrays are for more likely than average to fail - exactly the opposite situation than what we would hope.
This is because the environmental conditions tends to affect all drives in an array the same based on the assumption that the array is built all at one time and that the drives are physically kept together throughout the life of the array. There can be exceptions to this but they would be very uncommon. Because of this, an induced array resilver suggests that there was no reason to suspect the drives of increased error rates which means that the URE rate might be very, very good at the time of the resilver. This means that resilvers expected to fail when we look at the math model that only gives has an average (mean) failure rate we get successful resilver after success resilver. This also means that array expansion operations (replacing one drive at a time with a larger drive) would also be far more likely to be successful than expected based solely on the math averages.
This supposition is supported by anecdotal evidence that induced resilvers (which includes parity expansions) are vastly more likely to succeed than expected based on the average rates.
This suggests, then, that if the math model is correct and that the average truly is what it is projected to be that at times of non-induced failure when a drive has mechanically failed that other drives in the array are also more likely than average to experience drive failure or URE during the resilver operation. This is a very scary potential risk. It is a well known belief in the industry that drive failures are increasingly likely in a parity array after a first drive has failed - each failing drive leads to a higher likelihood of additional failures in general and a resilver operation itself is believed to increase failure rates during the operation time as well. But this goes beyond that by adding more failure factors clustered around the same event.
What this means is that in a scenario where, for example, the math model projects a roughly 50% failure rate (or 50% success rate, however you want to look at it) that when doing a test to verify if this is true there is a good chance that success rates could be much higher, more like 90% (a guess, we have no solid math for this) making tests and expansions provide a false sense of security under conditions that do not reflect a drive failure event. This then also projects that under conditions where a drive has truly failed and a resilver operation happens not for a test but to truly protect data that success rates might drop, potentially dramatically, perhaps to more like 10%! (Again, no math, just supposition based on observation.)
This allows for both the math model to be correct (as it produces only a mean average) and explains why anecdotal evidence shows induced rebuilds and expansions to be almost always successful and why non-induced anecdotal evidence shows failure rates far in excess of the projected failure average.
This is a very scary potential that will take a lot of testing and, sadly, primarily mathematical projects to support but is a "new" theory that explains both the math and the real world observations in a way that is logical, sensible and completely scary.
Quotes about non-uniformity from the original thread:
There are three key sources of non-uniformity to be considered. One is the UREs themselves. If a URE happens every one trillion reads, that's only a mean. It doens't mean that every trillionth read is a URE. It could be that they are in batches (one hundred happen in a row followed by ten trillion successful reads.) We have no idea. This distribution is not made available to us (a cool study to do.) Knowing what this distribution looks like and how it compares from drive to drive would be really informative.
The second is drive to drive. The URE rates are based on thousands of drives averaged. Is it that one drive gets URE 10^10 and the next gets URE 10^20? We don't know. So it might be that we get good and bad drives. Or maybe they are all nearly identical. So it might be that some arrays will basically always fail and others will basically always survive.
The third is environmental. Do factors outside of the drives affect URE rates? Likely, yes. We would expect this to be true. Magnetic interference, temperature, age, wear and tear, drive speed, batch, vibration and other factors likely all matter. But again, no one releases this data. So we are pretty much in the dark. What if drives get worse URE with age but start out really good? What if they have lots of UREs on the first run through but are good for a long time right after that? What if at 68 degrees they never get them but at 71 they get them all the time? All interesting questions and costly to answer.
Even without all of the "fancy maths", that just logically makes sense. "When we run tests on our study vehicle with 30,000 miles to see if an engine swap is feasible, we get high-success outcomes; therefore, we expect roughly the same results once the vehicle breaks down at 150,000 miles." The general entropy factor is not equal between the two scenarios, thus the results are skewed.
When we say it, it makes sense. But what has been catching nearly everyone in nearly every discussion I've seen over the past few years is that we are presented with some solid math that only talks about a mean average failure rate over the life of the drive and a mean average recovery rate based on that rate. This leaves all discussions based on that math talking about nothing but mean averages. It's tempting to stick with that because it is the supposedly "known" projections.
Going into non-uniform distribution makes tons of sense, but there is no math behind it and the simplistic projects where we just have to say "risk is X based on the average" become "there are tons and tons of factors we can't project so we guess it is more or less than X because of this reason or that one."
This means that resilvers expected to fail when we look at the math model that only gives has an average (mean) failure rate we get successful resilver after success resilver. This also means that array expansion operations (replacing one drive at a time with a larger drive) would also be far more likely to be successful than expected based solely on the math averages.
And that's how my boss got away with it heheh (well so far)