Invalid Drive Movement from HP SmartArray P411 RAID Controller with StorageWorks MSA60

Shuey

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

We don't have a support contract on this server or the attached MSA, and they're likely way out of warranty (ProLiant DL360 G8 and a StorageWorks MSA60), so I'm not sure how much we'd have to spend in order to get HP to "help" us :-S...

A bit. Why is there an MSA out of contract? The only benefit to an MSA is the support contract. Not that that makes it worth it, but proprietary storage requires a warranty contract to be viable. The rule is that any storage of that nature needs to be decommissioned the day before the support contract runs out because there isn't necessary any path to recovery in the event of an "incident" without one. It's not a standard server that you can just fix yourself with third party parts. Sometimes you can, but as it is a closed, proprietary system, you are generally totally dependent on your support contract from the vendor to keep it working.

There is a good chance that this is a "replace the MSA and restore from backup" situation in that case.

Unfortunately, my company's philosophy on "investing in IT infrastructure" goes like this: "We'll spend hundreds to thousands of dollars every time our PACS vendor tells us they need it. Then, when they say that they need to upgrade their equipment, we'll re-purpose their old stuff for the rest of our production environment (because we don't understand the importance of spending money on the rest of our infrastructure, and we don't trust the knowledgeable people we hired in our IT department)"

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

I actually rebooted this server multiple times about a month ago when I installed updates on it. The reboots went fine. We also completely powered that server down at around the same time because I added more RAM to it. Again, after powering everything back on, the server and raid array information was all intact.

Does your normal reboot schedule of your server include a reboot of the MSA? Could it be that they were powered back on in the incorrect order? MSAs are notoriously flaky, likely that is where the issue is.

I'd call HPE support. The MSA is a flaky unit but HPE support is quite good.

We unfortunately don't have a "normal reboot schedule" of ANY for our servers :-/...

I should not have said schedule. I should have said your "Normal reboot process." Regardless of the regularity of the reboots, is the process a standard one?

I'm not sure we have a "standard"... we only reboot this particular ESXi host when absolutely necessary, and this weekend is possibly the first time we've rebooted the MSA in a year or more :-S...

For the future, sadly it is too late now, but consider these things...

A monthly reboot at the least of everything, not just some components, let's you test that things are really working and at a time when you can best fix them.
Avoid devices like the MSA in general, they add a lot of risk fundamentally.
Avoid any proprietary "black box" system that is out of support. While these systems can be good when under support, the moment that they are out of support their value hits a literal zero. They are effectively bricks. Would you consider running the business on a junk consumer QNAP device? This device when out of support is far worse.

scottalanmiller

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

@scottalanmiller said in "Invalid Drive Movement" (HP Smart Array P411):

@Shuey said in "Invalid Drive Movement" (HP Smart Array P411):

We don't have a support contract on this server or the attached MSA, and they're likely way out of warranty (ProLiant DL360 G8 and a StorageWorks MSA60), so I'm not sure how much we'd have to spend in order to get HP to "help" us :-S...

A bit. Why is there an MSA out of contract? The only benefit to an MSA is the support contract. Not that that makes it worth it, but proprietary storage requires a warranty contract to be viable. The rule is that any storage of that nature needs to be decommissioned the day before the support contract runs out because there isn't necessary any path to recovery in the event of an "incident" without one. It's not a standard server that you can just fix yourself with third party parts. Sometimes you can, but as it is a closed, proprietary system, you are generally totally dependent on your support contract from the vendor to keep it working.

There is a good chance that this is a "replace the MSA and restore from backup" situation in that case.

Unfortunately, my company's philosophy on "investing in IT infrastructure" goes like this: "We'll spend hundreds to thousands of dollars every time our PACS vendor tells us they need it. Then, when they say that they need to upgrade their equipment, we'll re-purpose their old stuff for the rest of our production environment (because we don't understand the importance of spending money on the rest of our infrastructure, and we don't trust the knowledgeable people we hired in our IT department)"

Simply explain that an unsupported MSA is a dead device, totally useless. When asked to use it, explain that it's not even something you'd play around with at home.

Even a brand new, supported MSA falls below my home line. But once out of support, it's below any home line.

http://www.smbitjournal.com/2014/11/the-home-line/

scottalanmiller

What have you been trying thus far? What's your current triage strategy assuming that we can't fix this?

scottalanmiller

Edited to add tags and upgrade the title for SEO and rapid visual determination.

Shuey

You guys are not going to believe this...

First I attempted a fresh cold boot of the existing MSA, waited a couple minutes, then powered up the ESXi host, but the issue remained. I then shutdown the host and MSA, moved the drives into our spare MSA, powered it up, waited a couple minutes, then powered up the ESXi host; the issue still remained.

At that point, I figured I was pretty much screwed, and there was nothing during the initialization of the RAID controller where I had an option to re-enable a failed logical drive. So I booted into the RAID config, verified again that there were no logical drives present, and I created a new logical drive (RAID 1+0 with two spare drives; same as we did about 2 years ago when we first setup this host and storage).

Then I let the server boot back into vSphere and I accessed it via vCenter. The first thing I did was removed the host from inventory, then re-added it (I was hoping to clear all the inaccessible guest VMs this way, but it didn't clear them from the inventory). Once the host was back in my inventory, I removed each of the guest VMs one at a time. Once the inventory was cleared, I verified that no datastore existed and that the disks were basically ready and waiting as "data disks". So I went ahead and created a new datastore (again, same as we did a couple years ago, using VMFS). I was eventually prompted to specify a mount option and I had the option of "keep the existing signature". At this point, I figured it'd be worth a shot to keep the signature - if things didn't work out, I could always blow it away and re-create the datastore again. After I finished the process of building the datastore with the keep signature option, I tried navigating to the datastore to see if anything was in it - it appeared empty. Just out of curiosity, I SSH'd to the host and checked from there, and to my surprise, I could see all my old data and all my old guest VMs! I went back into vCenter and re-scanned storage and refreshed the console, and all of our old guest VMs were there! I re-registered each VM and was able to recover everything! All of our guest VMs are back up and successfully communicating on the network.

I think most people in the IT community would agree that the chances of having something like this happen are extremely low to impossible.

As far as I'm concerned, this was a miracle of God...

scottalanmiller

That is seriously amazing!

scottalanmiller

Next step... get local drives and decom that MSA60. It just sent a shot across your bow and has exposed how dangerous and precarious it is. Don't fail to heed its warning.

Shuey

@scottalanmiller said in Invalid Drive Movement from HP SmartArray P411 RAID Controller with StorageWorks MSA60:

Next step... get local drives and decom that MSA60. It just sent a shot across your bow and has exposed how dangerous and precarious it is. Don't fail to heed its warning.

Absolutely Scott! I'm gonna be talking more with my boss about this as soon as possible!

Shuey

Thanks again to everyone who replied and gave feedback on this. It's great to know that there's a solid community of knowledgeable people who are willing to share their expertise - I really appreciate it!!

scottalanmiller

@Shuey said in Invalid Drive Movement from HP SmartArray P411 RAID Controller with StorageWorks MSA60:

Thanks again to everyone who replied and gave feedback on this. It's great to know that there's a solid community of knowledgeable people who are willing to share their expertise - I really appreciate it!!

Sadly we didn't find your solution. But happily you found it on your own!