What To Do when RAID Has a Hard Drive Failure

scottalanmiller

This specific question comes up often enough that a guide is necessary.

First we need to understand: The main purpose of a RAID array is to ensure continuous availability of services. While there are other reasons to use RAID, such as getting greater speed that a single drive can deliver or aggregating capacity, reliability and durability of services is far and away the primary goal of RAID especially in a business setting and even moreso in a smaller business.

Note: This guide assumes that you are not on RAID 0. RAID 0 is not for the purpose mentioned above and behaves completely differently, which is why many people do not consider it to be RAID at all. There is nothing to be done if you are on RAID 0, if a drive fails in any way, all data on the array is lost. Replace the failed drive, make a new array, start over. That is all that there is to be done.

So you have a RAID array and it has had a drive failure. If the RAID array is functioning properly there should be no impact to your running server and you should have only gotten an alert. If the server has been impacted in any way you have had more than a drive failure and your array has failed in some way, perhaps only a controller failure, perhaps more than one drive has failed and the array is dead. But if the array is functioning properly, there will be nothing except an alert, maybe a warning light and performance loss on the array.

First: DO NOT POWER OFF THE SERVER. This cannot be overstated. Power cycling a failed array is very traumatic for the hardware and is the most likely time to induce a failure. With rare exception non-business class RAID, the RAID array is designed to let you replace the drive while the system is still running (hot swap) and this provides a significant amount of protection. (If your "server" does not have hot swap it fails to meet most industry assumptions about server class hardware and needs to be considered for an upgrade immediately following this recovery process.)

Second: Get the replacement drive ordered ASAP. Hopefully you have a hardware support agreement and the vendor will be there shortly. If not, hopefully you have the replacement drive on hand already. If not, get it ordered and get it shipped as quickly as possible. Time is of the essence.

Third: Determine if you have hot swap or not. This is crucial. Sadly, there is really no way to know for sure unless you know what your RAID situation is. Enterprise Software RAID universally supports hot swap, hardware RAID universally supports hot swap, both SAS and SATA support hot swap but your entire hardware stack must support it. Anything widely considered server class will have hot swap capabilities but some people opt to remove this. If you believe that people have bypassed this, you will need to figure this out. 99% of people will have hot swap. You can identify that the server was built for hot swap by the fact that the drives are designed to be removed and are accessible from the outside of the chassis rather than from the inside only. Hot swap drives come in trays that slide out. All server class, NAS and SAN business class devices, even those in the $200 range, support hot swap.

Fourth: Determine if you have blind swap or not. Nearly all servers will have this so it is generally a foregone conclusion, but with the increasing popularity of bypassing hardware RAID for software RAID even when hardware RAID exists, a risk of it having been bypassed by the person who set up the server exists. If you are using any business class hardware RAID controller (LSI, Adaptec, PERC, SmartArray, MetaRAID, etc.) you will have blind swap. All commodity servers are build with hardware RAID and blind swap as an enhancement to their hardware RAID offering. You just need to be aware if this has been bypassed. Unfortunately, there is no shortcut to knowing your environment. If you don't know for sure you must either guess or investigate.

Five: While you are waiting on the replacement drive to arrive, take a backup. In many cases, you are going to want to limit the usage of the array during this time, especially if it is RAID 6 and absolutely critical if it is RAID 5. Get unnecessary users off, stop unnecessary processes, move workloads off, send people home early, whatever. But try to limit usage of the array as it is already stressed and in a dangerous state. If you are on RAID 10, this is not the same level of concern. But get a backup, you are in a risky position with any RAID array and this is the time most likely to experience data loss, so get a backup going while you wait. You don't have much else to do but sit and worry right now anyway.

Six: Replace the drive as soon as you can. There are rare cases where you will want to kick off the replacement until the backups have finished or users are done for the day or whatever. Generally this is rare, especially in the SMB, and you will simply want to swap the good drive for the bad one immediately.

If this is blind swap, you just pull it out and replace it. For 95% of people, this is all that you need to do. Pull out the old, put in the new. Done.
If this is hot swap but without blind swap you will need to go into the OS, disable the bad drive, then pull it out, put in the new, enable the new in the OS and tell it to rebuild.
If this is not hot swap you need to identify the bad drive, power down the servers, remove the bad drive, replace it with the new one, power up the server then add the drive to the array and kick off the rebuild.

Seven: Wait. This process takes a while. Ideally, keep people from using the system while this is rebuilding but that is not always possible. With RAID 10, this matters little. With RAID 6, a bit. With RAID 5, a lot.

That's it. For the bulk of people a RAID replacement means nothing more than "look for the amber light, pull it out, put in a replacement." Only in rare cases where standard components have not been used does anything more need to be done.

scottalanmiller

How Long Will a RAID Rebuild Take?

This is tough to answer. It depends on the speed of your drives, size of the drives, size of the array, array type, controller, miscellaneous options and usage during the rebuild.

But typically it takes many hours. In a best case scenarios a RAID 1 or RAID 10 mirrored array will take as long as it takes for one drive to copy, block by block, sequentially to another identical drive. If these are 2TB drives, 2TB of block data will need to be copied.

In parity arrays (RAID 5, 6 and 7) this process can take a very long time because all drives in the array must be read, a complex set of calculations done and then the final drive written to. If data is being modified while this process is ongoing it can have huge impacts on the process. For moderately sized arrays it is not uncommon for this process to takes days or even weeks and it is not unheard of or unobserved for the process on RAID 6 to take over a month.

Dashrender

@scottalanmiller said:

How Long Will a RAID Rebuild Take?

For moderately sized arrays it is not uncommon for this process to takes days or even weeks and it is not unheard of or unobserved for the process on RAID 6 to take over a month.

Wow, I realize that drives in single arrays don't fail all that often, but damn, a month? What is the likeliness of a second or even third drive failure in a situation where the resilver takes a month? This reduced performance and risk state in that situation really seems to make RAID 6 a bad choice, unless having access to the data just isn't that critical, and rebuilding the array from backups is almost considered norm.

dafyre

@Dashrender said:

@scottalanmiller said:

How Long Will a RAID Rebuild Take?

For moderately sized arrays it is not uncommon for this process to takes days or even weeks and it is not unheard of or unobserved for the process on RAID 6 to take over a month.

Wow, I realize that drives in single arrays don't fail all that often, but damn, a month? What is the likeliness of a second or even third drive failure in a situation where the resilver takes a month? This reduced performance and risk state in that situation really seems to make RAID 6 a bad choice, unless having access to the data just isn't that critical, and rebuilding the array from backups is almost considered norm.

A month is not unreasonable. I've seen a 7TB RAID 5 setup take a week to rebuild.

scottalanmiller

@Dashrender said:

@scottalanmiller said:

How Long Will a RAID Rebuild Take?

For moderately sized arrays it is not uncommon for this process to takes days or even weeks and it is not unheard of or unobserved for the process on RAID 6 to take over a month.

Wow, I realize that drives in single arrays don't fail all that often, but damn, a month? What is the likeliness of a second or even third drive failure in a situation where the resilver takes a month? This reduced performance and risk state in that situation really seems to make RAID 6 a bad choice, unless having access to the data just isn't that critical, and rebuilding the array from backups is almost considered norm.

Secondary drive failure starts to become incredibly common under a month of parity resilver stress. This is just one of so many reasons why parity RAID is rarely good for production workloads

scottalanmiller

@dafyre said:

A month is not unreasonable. I've seen a 7TB RAID 5 setup take a week to rebuild.

And that's a tiny array by modern standards! Just move that to RAID 6 and you likely add 20% or more time for that rebuild because there is so much more math involved. And lots of modern arrays are 14TB or bigger. That same array on 14TB would be a fortnight. Put that on RAID 6, which would be absolutely necessary for an array so large, and you are up to three weeks almost certainly.

Now when a lot of these shops are talking 30TB+ and always RAID 6 or RAID 7, you start to see where a month is easy to hit. I've seen some that we estimated at far longer.

And if people are using the drive during the day, you can easily lose 30% of your rebuild time from the array being under load.

Dashrender

Again, at that point, is it worth the risk of resilvering? Seems almost better to ensure you have a good backup, then wipe restore - hell, it will probably take less time.

scottalanmiller

@Dashrender said:

Again, at that point, is it worth the risk of resilvering? Seems almost better to ensure you have a good backup, then wipe restore - hell, it will probably take less time.

If you believe that to be the case, you should probably not have had a RAID array in the first place It would be pretty rare that you made a RAID array with the intent of not recovering from a drive failure.

Of course, it does, we presume, afford an opportunity to take a last minute backup before doing the rebuild with a fresh array, but it requires it.

But it is almost better to take rapid backups and use RAID 0 if we don't intend to ever do the drive replacement on parity RAID.

BRRABill

One thing that always confuses me (and perhaps this is just with DELL servers) is that if the drive isn't actually failed yet (but in predictive failure status) you have to (or are supposed to) log and take the drive offline first.

I know you were talking about failed drives, but I think this is also worth mentioning under the same context.

scottalanmiller

@BRRABill said:

One thing that always confuses me (and perhaps this is just with DELL servers) is that if the drive isn't actually failed yet (but in predictive failure status) you have to (or are supposed to) log and take the drive offline first.

Supposed to, because you are preventing the risks that come with it actually failing. When a drive actually fails, the controller offlines it. When it is predictive, it does not. So the drive is alive and spinning if you pull it. It should work, but why add risk? Offline it and be sure that it is spun down and not being used when it gets yanked.

BRRABill

@scottalanmiller said:

Supposed to, because you are preventing the risks that come with it actually failing. When a drive actually fails, the controller offlines it. When it is predictive, it does not. So the drive is alive and spinning if you pull it. It should work, but why add risk? Offline it and be sure that it is spun down and not being used when it gets yanked.

Right. You know me ... always thinking of the NOOB questions. (WWBA? What Would BRRABill Ask? )

marcinozga

I've had 12TB RAID 6 (18TB raw), with 10TB of data on it rebuilt in about 24h. Software RAID shines here.

scottalanmiller

@marcinozga said:

I've had 12TB RAID 6 (18TB raw), with 10TB of data on it rebuilt in about 24h. Software RAID shines here.

What drives and software RAID implementation are you using?

And yes, software RAID can divert system resources to the computations making it much faster. People really don't realize just how much faster software RAID is than hardware RAID.

marcinozga

@scottalanmiller said:

@marcinozga said:

I've had 12TB RAID 6 (18TB raw), with 10TB of data on it rebuilt in about 24h. Software RAID shines here.

What drives and software RAID implementation are you using?

And yes, software RAID can divert system resources to the computations making it much faster. People really don't realize just how much faster software RAID is than hardware RAID.

ZFS on FreeBSD and WD Red 3TB. And quad core Xeon w/HT.

MattSpeller

I usually just power off the server, yank all the drives out, mix em all up and put em back in. I also enjoy testing my backups quite often.

I'll have to try your method next time!