Why We Do Not Preemptively Replace Hard Drives
I was recently shocked to learn that some IT pros were pushing very strongly that businesses should be proactively removing healthy, burned in hard drives from production to be replaced with new, untested hard drives without any indication of impending failure or existing errors, let alone an active failure. The thought process here, presumably, is that drives will wear out and fail and that by replacing them early we can avoid these failures. But there are many problems with this approach.
- We do not know when hard drives are going to fail or even remotely so. Statistics from other drives give us very little insight into our own current drives so other pools of drives, even of the same drives that we are using, tells us little as our own environmental conditions play a significant role in hard drive lifespans. So we are forced to either live with a lot of risk of drives failing on their own anyway or we must replace them excessively early when there is almost no expectation of failure.
- We induce a failure in order to do the replacement. This is a bit of a problem. Replacing an old drive with a new one is not a painless or safe process. To the RAID array, we have had a failure and the array goes into a degraded state while the drive is resilved and added back into the array. Because we are removing healthy drives to do this, we are actually creating more failures, potentially a lot more failures and more time in a degraded state than necessary. Presumably we can do this at a scheduled time with planning, but it still represents additional risk that need not exist. This risk skyrockets if we are dealing with parity RAID where the reduction in redundancy can be dramatic and the time to rebuild can be very high.
- Drives have a bathtub curve of failure. One of the problems is that we don't know how wide the bathtub is. The new drives that we put in are at one of the high points of the curve making them quite risky. In drive terms this is referred to as "infant mortality." So we have a very real risk that we are removing a perfectly healthy, well tested drive to introduce a much less well known drive with a statistically higher failure rate. The earlier we replace drives, the worse this problem becomes.
- Induced failures and resilver attempts, especially in parity arrays, puts an enormous strain on remaining drives increasing the chances of additional failures in the existing drive pool.
- MTBF numbers tell us nothing about the chances of drive failures. MTBF is not mean time between failures as people often assume but measure as the Mean Time Before Failure as measured by a large pool of drives until one of them fails. It's not a very useful number to use to determine anything about the width or depth of a bathtub curve, it is almost completely useless. This leaves us completely in the dark as to drive failure rates for our drives of choice and for our own environment. So we may be replacing drives before they go through their healthiest period.
- Drives do not fail together. Well, sort of. The theory is that an MTBF tells us when a drive will fail, but this is based on the assumption that drives in a pool fail at nearly the exact same time. To some degree they fail close to each other when from the same batch, but this effect is minor and not well documented. There is little reason to fear this effect as it is mostly a myth rather than a statistically measured phenomenon and while there are good reasons to suspect it exists it does not play out as a significantly measured risk in the real world. And, more importantly, there are far better ways to mitigate this risk such as getting drives from different batches.
- By proactively pulling healthy drives from service we give up our own ability to measure the reliability curves of those drives in our own environment leaving us even more in the dark about how they would have performed and if replacing them was at all effective as we have no way to know if we were replacing them at what point of the bathtub curve. So we end up knowing even less about our current drives and, more importantly, our own environment's effect on drive healthy.
So as you can see, the idea of replacing drives that have not failed and are exhibiting no signs of pending failure (SMART errors, for example) not only provides no value except for a false sense of security but also introduces very real risk, blocks us from useful information and wastes potentially a very large amount of money.
That's super weird they'd replace them pro actively. I could imagine scenarios where it might make some sense to do it, but not much sense. Maybe driven by super high labour costs for maintenance / access / tear down? Array is already off line and you know of a bad batch of drives?
I agree, I was pretty taken aback that this wasn't just mentioned but pretty strongly defended even to go so far as to suggest thinking that you could skip doing this meant you didn't understand storage.
This is the thread that prompted this as a point of concern, so I felt that it should be documented so anyone confused by the line of reasoning presented would have something to look to as to why that would not work.
For example: Umm, Really Scott? This really speaks to your experience level in server operations. As I read this, you never pre-emptively replace hard drives? Are you really telling us that?
He seems to honestly think that I said something weird. I've never even talked to a company that would consider doing something like this before, and I've worked for a few governments, reseaerch hospitals, Wall Street and Canary Wharf firms, the world's biggest bank, the world's biggest hedge fund, military contractors, world's biggest IT firm, etc. None would ever do something like this both because of the risk and because of the waste of money.
What I fear is that this is a common practice in some segment and is being done by IT in a bubble who don't realize what a weird and bad practice this is.
This is the quote that brought it to our attention: Replacing array drives in situ is part of normal preventive maintenance...
"Scott, your argument that MTBF is meaningless has no substance. Perhaps you do not have as strong a background in electronics theory and design. Also, I am promoting the concept of STAGGERED drive replacement, which mitigates multiple simultaneous drive failures: instead of having an entire array of drives with exactly the same wear and tear at any instant in time, each drive has a different estimated failure date. The more you stagger the individual drive replacements, the better your possibility of encountering only one drive failure at any given time. "
I'm going to don my protective equipment, pop some corn and find a comfy chair.
Sorely tempted to reply to him myself.
He posted another comment, appears to be very unprofessional & slagging / condescending. Gross.
The staggering bit has merit, the issue comes from how the staggering happens. When we replace drives after organic failure, they naturally stagger for the next found of failures. So you don't gain staggering from this approach that would not have already existed.
The idea that causing a system to fail would be routine maintenance is just.... I have no idea. Imagine crashing a car, just a little, to cause the airbags to deploy so that you get fresh ones installed once every 50,000 miles. That just doesn't make sense.