Hot Swap vs. Blind Swap

scottalanmiller

There are three different concepts of drive swapping: cold, hot and blind (or blind hot.)

Cold Swapping: No enterprise or business class servers would ever require cold swapping of hard drives, although a shocking number of IT pros assume that cold swapping is required or recommended. Cold swapping of drives means that the entire server is powered down before being able to replace a failed drive. This defeats much of the value of RAID and is mostly a product of consumer desktop products.

Hot Swapping: The ability to replace a hard drive while the server is still running. This allows for zero downtime drive replacement so, in theory, an array could be replaced, drive by drive, over a period of time many times over without the server ever needing to be powered down. Any enterprise class or business class server will be hot swap by definitionl. Lacking this feature would disqualify a device from being considered business ready as a server.

Blind Swapping: Generally a unique feature to hardware RAID systems. This is an extension of hot swapping that includes not needing to interact with the operating system first. Hot swapping alone does not imply that a lack of interaction is needed. Blind swapping is popular in large datacenters so that datacenter staff who do not have access to the operating system can replace failed drives without any interaction from the systems administrators.

BRRABill

Can you pull a hotplug drive out at any time, or is that dependent on server manufacturer and RAID card?

scottalanmiller

@BRRABill said:

Can you pull a hotplug drive out at any time, or is that dependent on server manufacturer and RAID card?

Depends what you are asking.

The hardware will determine if pulling a hot plug drive will cause a short. Hot plug hardware will allow a random drive yank to not cause an electrical issue.

Hot plug software will allow the OS to allow you to tell it that a drive has been removed and then have it add it when you replace it.

Blind swap allows you to just walk up to an array, pull a drive without preparing anything and put a new one in without needing to tell them system what you have done.

BRRABill

I ask because I had an issue yesterday on our DELL server. Which, admittedly, is very, very old. Experienced, I should say. No one likes to be called old.

It's our main data server. One of two servers that really matter.

We have 4 drives in a RAID5 array. (This is from the dark ages when that was considered OK.)

I went into the server room for something else, and noticed one of the drives was blinking amber. I go from a 1 to a 5 on the 1 to 10 anxiety scale because that kind of stuff always makes me nervous. Anyway, no problem, I have spare drives on the shelf ready to go. I pull out the old drive. No problem. I put in the new drive, no problem. I go to log in to start rebuilding the array, and I notice that the server is rebooting. Hmm, that's odd. I look at the drive. Now TWO of the four are blinking amber. I've now gone to a 10, LOL.

Turns out a second drive failed after I did the hot plug. I'm not sure if it was just random (which seems unlikely) or something wierd happened during the hot plug.

I spent a long, long time getting everything back to how it was.

scottalanmiller

RAID 5 induces other failures when you go to rebuild. It's extremely common and just an artifact of that RAID level. Doesn't mean that it will always do it or even normally do it, but it is very common. Once you do a drive swap it immediately increases the load on the drives and makes them more likely to fail.

BRRABill

Interesting. The second failed drive definitely sounded like it was dead...mechanical issue.

I think that happened to me a long time ago on a server, which is why I'm always nervous doing it.

THOUGH thanks to ML I'll never have another RAID 5 array, so no need to worry!

It doesn't do that for any other RAID level?

And I am assuming RAID 5 of SSDs wouldn't do that?

scottalanmiller

@BRRABill SSDs do suffer from mechanically induced failed like Winchester drives.

scottalanmiller

RAID 6 induces even more immediate wear and tear so is even more likely to kill off a second drive at the time of drive replacement PLUS has one extra drive to have fail but can withstand losing one additional drive so is dramatically safer overall.

BRRABill

@scottalanmiller said:

@BRRABill SSDs do suffer from mechanically induced failed like Winchester drives.

Is the rate the same? Or is this a random (but common) thing?

scottalanmiller

@BRRABill said:

@scottalanmiller said:

@BRRABill SSDs do suffer from mechanically induced failed like Winchester drives.

Is the rate the same? Or is this a random (but common) thing?

Sorry that was a typo. SSDs do NOT suffer mechanically induced failure.

BRRABill

Oh. Phew.

What is the point of RAID if that happens?

That's it. I'm quitting IT.

I've had enough.

scottalanmiller

@BRRABill not much point to RAID 5, that's what we've been saying for years. By 2009 it was so dangerous that it was actually worse in most cases than doing nothing at all.

BRRABill

Well this server is from well before 2009.

It's a miracle nothing has happened yet.

scottalanmiller

That is indeed pretty old.

BRRABill

I'm not going to say exactly HOW old because I've not sure I can take any more heads shaking at me this month. LOL.

drewlander

@BRRABill said:

0 anxiety scale because that kind of stuff always makes me nervous. Anyway, no problem, I have spare drives on the shelf ready to go. I pull out the

In complete honesty I will admit that one time I was cold swapping a failed drive in a proliant dl360G5 and replaced the wrong one. Fortunately the server wouldnt even boot and I was able to power it down, sort it out and bring it back up. Since then I will never run a server without the backplane kit and hot swappable drive caddies with the status indicator LED.

drewlander

@BRRABill Sounds like a situation I had to deal with last year where an organization was running Dell PowerEdge 2950 Gen II pizza boxes. I tried reasoning with them explaining that 9 year old servers should not be production machines for mission critical systems. They didn't seem to care about business continuity until they started failing.

BRRABill

@drewlander said:

@BRRABill Sounds like a situation I had to deal with last year where an organization was running Dell PowerEdge 2950 Gen II pizza boxes. I tried reasoning with them explaining that 9 year old servers should not be production machines for mission critical systems. They didn't seem to care about business continuity until they started failing.

This was a PowerEdge 2800. I've been kind of proud of the fact that I kept these things up and running for so long. And considering the low RAM and age, they still run awesome.

BUT ... like I said it's a miracle that things haven't gone south quicker. The second drive that failed was a replacement drive, which of course was not new.

Key point, as in anything, is to always have a good backup.

scottalanmiller

We once had a set of Compaq Proliant 800s that made it a decade without failing. They were all retired effectively still healthy - just old and worthless.

BRRABill

@scottalanmiller said:

We once had a set of Compaq Proliant 800s that made it a decade without failing. They were all retired effectively still healthy - just old and worthless.

That's about where we are. I've hung lucky mementos in there, and am hoping for the best.

I actually have a construction paper good luck charm a vendor's wife once gave me a long time ago (before these servers even) that's actually hanging in there. It has done it's job pretty good so far.