Hot Swap vs. Blind Swap
-
@BRRABill said:
Can you pull a hotplug drive out at any time, or is that dependent on server manufacturer and RAID card?
Depends what you are asking.
The hardware will determine if pulling a hot plug drive will cause a short. Hot plug hardware will allow a random drive yank to not cause an electrical issue.
Hot plug software will allow the OS to allow you to tell it that a drive has been removed and then have it add it when you replace it.
Blind swap allows you to just walk up to an array, pull a drive without preparing anything and put a new one in without needing to tell them system what you have done.
-
I ask because I had an issue yesterday on our DELL server. Which, admittedly, is very, very old. Experienced, I should say. No one likes to be called old.
It's our main data server. One of two servers that really matter.
We have 4 drives in a RAID5 array. (This is from the dark ages when that was considered OK.)
I went into the server room for something else, and noticed one of the drives was blinking amber. I go from a 1 to a 5 on the 1 to 10 anxiety scale because that kind of stuff always makes me nervous. Anyway, no problem, I have spare drives on the shelf ready to go. I pull out the old drive. No problem. I put in the new drive, no problem. I go to log in to start rebuilding the array, and I notice that the server is rebooting. Hmm, that's odd. I look at the drive. Now TWO of the four are blinking amber. I've now gone to a 10, LOL.
Turns out a second drive failed after I did the hot plug. I'm not sure if it was just random (which seems unlikely) or something wierd happened during the hot plug.
I spent a long, long time getting everything back to how it was.
-
RAID 5 induces other failures when you go to rebuild. It's extremely common and just an artifact of that RAID level. Doesn't mean that it will always do it or even normally do it, but it is very common. Once you do a drive swap it immediately increases the load on the drives and makes them more likely to fail.
-
Interesting. The second failed drive definitely sounded like it was dead...mechanical issue.
I think that happened to me a long time ago on a server, which is why I'm always nervous doing it.
THOUGH thanks to ML I'll never have another RAID 5 array, so no need to worry!
It doesn't do that for any other RAID level?
And I am assuming RAID 5 of SSDs wouldn't do that?
-
@BRRABill SSDs do suffer from mechanically induced failed like Winchester drives.
-
RAID 6 induces even more immediate wear and tear so is even more likely to kill off a second drive at the time of drive replacement PLUS has one extra drive to have fail but can withstand losing one additional drive so is dramatically safer overall.
-
@scottalanmiller said:
@BRRABill SSDs do suffer from mechanically induced failed like Winchester drives.
Is the rate the same? Or is this a random (but common) thing?
-
@BRRABill said:
@scottalanmiller said:
@BRRABill SSDs do suffer from mechanically induced failed like Winchester drives.
Is the rate the same? Or is this a random (but common) thing?
Sorry that was a typo. SSDs do NOT suffer mechanically induced failure.
-
Oh. Phew.
What is the point of RAID if that happens?
That's it. I'm quitting IT.
I've had enough.
-
@BRRABill not much point to RAID 5, that's what we've been saying for years. By 2009 it was so dangerous that it was actually worse in most cases than doing nothing at all.
-
Well this server is from well before 2009.
It's a miracle nothing has happened yet.
-
That is indeed pretty old.
-
I'm not going to say exactly HOW old because I've not sure I can take any more heads shaking at me this month. LOL.
-
@BRRABill said:
0 anxiety scale because that kind of stuff always makes me nervous. Anyway, no problem, I have spare drives on the shelf ready to go. I pull out the
In complete honesty I will admit that one time I was cold swapping a failed drive in a proliant dl360G5 and replaced the wrong one. Fortunately the server wouldnt even boot and I was able to power it down, sort it out and bring it back up. Since then I will never run a server without the backplane kit and hot swappable drive caddies with the status indicator LED.
-
@BRRABill Sounds like a situation I had to deal with last year where an organization was running Dell PowerEdge 2950 Gen II pizza boxes. I tried reasoning with them explaining that 9 year old servers should not be production machines for mission critical systems. They didn't seem to care about business continuity until they started failing.
-
@drewlander said:
@BRRABill Sounds like a situation I had to deal with last year where an organization was running Dell PowerEdge 2950 Gen II pizza boxes. I tried reasoning with them explaining that 9 year old servers should not be production machines for mission critical systems. They didn't seem to care about business continuity until they started failing.
This was a PowerEdge 2800. I've been kind of proud of the fact that I kept these things up and running for so long. And considering the low RAM and age, they still run awesome.
BUT ... like I said it's a miracle that things haven't gone south quicker. The second drive that failed was a replacement drive, which of course was not new.
Key point, as in anything, is to always have a good backup.
-
We once had a set of Compaq Proliant 800s that made it a decade without failing. They were all retired effectively still healthy - just old and worthless.
-
@scottalanmiller said:
We once had a set of Compaq Proliant 800s that made it a decade without failing. They were all retired effectively still healthy - just old and worthless.
That's about where we are. I've hung lucky mementos in there, and am hoping for the best.
I actually have a construction paper good luck charm a vendor's wife once gave me a long time ago (before these servers even) that's actually hanging in there. It has done it's job pretty good so far.
-
True story. Right after I posted that last post, I went into the server room to take a picture of this paper good luck charm. On the way back down the hall, the building's power went out, and has been out the past 3 hours. This week is just AWESOME!
Anyway, here is the picture:
Note the failed DELL right below it.
It did its job for many years, though. No complaints.
-
P.S. If anyone can read that, and it DOESN'T say good luck, please don't let me know.