Strange Smart Array p410i problem
-
Hello!
I have a DL380 G7 file server with FreeNAS 9.2. The data that is shared (CIFS, NFS) is on a RAID 6 made up of 6 x 3 TB MDL disks attached to a Smart Array p410i. Now, last week the disk started to get really slow. I had to shut down multiple VMs and move them off the server in order for my users to get to do any work. I had to move the VMs over nights several days in a row... The read rate of the drive was about 3 MB/s... All green lights blinking on the front of the machine. I was really puzzled... I wondered if perhaps the battery for the cache had stopped working. I have had no problems with this setup earlier (has been running for at least 10 months), but discovered now that FreeNAS/FreeBSD is not well supported by HP (or the other way around), so I could not get any info from inside FreeNAS. No software to inspect the status of the RAID. When all VMs and critical data was moved off of the server I could finally reboot it and run the HP SmartStart CD. I ran a short diagnostic test, and .... one of the drives was marked as Failed, due to too many read/write errors. Now, that explains it all. I replaced the disk, waited 12 (!) hours for the rebuild to finish, and now read/write is up to 250-400 MB/s. All is well.
So, the question remains (and I have sent this to my HP dealer);
Why was the drive not clearly (via the LEDs) marked as bad, and kicked from the RAID?Has anyone ever seen this behaviour?
-
Welcome to the MangoLassi community! Always nice to see new faces.
-
Thank you!
I just registered after finding this site when searching for an article you have written... I was a little surprised seeing it was so easy to register. Nice!
-
Welcome flomer! I do agree, that this seems weird!
-
A co-worker reasoned that the RAID had been so slow because whenever data was read or written to the RAID that one bad drive would have a hard time writing or reading data to some area (or several) that had errors, and that this would lead to the machine trying over and over...? But, I figured that in such circumstances the controller would kick the driveout... Perhaps this controller is a perfect match for WD greens
-
@flomer said:
A co-worker reasoned that the RAID had been so slow because whenever data was read or written to the RAID that one bad drive would have a hard time writing or reading data to some area (or several) that had errors, and that this would lead to the machine trying over and over...? But, I figured that in such circumstances the controller would kick the driveout... Perhaps this controller is a perfect match for WD greens
Well if they drive was having issues but not ones that would make it fail, it might have been marked as healthy while still slowing down. It's possible. Any one drive being slow will definitely kill the performance of any RAID array as they all have to wait on that one for each read or write operation. That it slowed things down isn't surprising, that's not uncommon to have happen ahead of an actual failure.
-
Twelve hours for a RAID 6 resilver is actually pretty good I've seen them top a month.
-
FreeBSD is a great OS but if the hpacucli is not available for it then it sucks a bit running it on the bare metal. If you use a hypervisor like ESXi, HyperV or XenServer you can get around that as the errors go to the hypervisor instead of to a guest OS. Then you can virtualize FreeBSD on top of that without any problems.
-
@scottalanmiller said:
Well if they drive was having issues but not ones that would make it fail, it might have been marked as healthy while still slowing down. It's possible. Any one drive being slow will definitely kill the performance of any RAID array as they all have to wait on that one for each read or write operation. That it slowed things down isn't surprising, that's not uncommon to have happen ahead of an actual failure.
But, I have seen earlier on other DL380s that a drive will have an amber LED indicating that it is in "pre-failure" state, probably because of SMART-errors. That didn't happen here, and I have not seen a RAID before that has been so slow and troubled by a bad disk... I don't have that much experience, though... I will be interesting to hear what HP or our dealer will say about this. I also asked them if they think the RAID card is faulty. The ease of having a controller guiding you is sort of not present anymore, might as well buy an LSI HBA and go for ZFS. I woudl surely have been notified about this.
-
@scottalanmiller said:
FreeBSD is a great OS but if the hpacucli is not available for it then it sucks a bit running it on the bare metal. If you use a hypervisor like ESXi, HyperV or XenServer you can get around that as the errors go to the hypervisor instead of to a guest OS. Then you can virtualize FreeBSD on top of that without any problems.
Perhaps I should try that? This incident got me thinking that perhaps I have to look closer at my ESX server also, since this machine is a DL385 with the same controller...
-
Interesting. This sounds like a similar issue I'm having with G8 server.
-
I live in Norway and it's getting late here. I will return to this thread tomorrow!
-
@flomer said:
I live in Norway and it's getting late here. I will return to this thread tomorrow!
Well WELCOME...in Norwegian...
-
@flomer said:
But, I have seen earlier on other DL380s that a drive will have an amber LED indicating that it is in "pre-failure" state, probably because of SMART-errors. That didn't happen here, and I have not seen a RAID before that has been so slow and troubled by a bad disk... I don't have that much experience, though... I will be interesting to hear what HP or our dealer will say about this. I also asked them if they think the RAID card is faulty. The ease of having a controller guiding you is sort of not present anymore, might as well buy an LSI HBA and go for ZFS. I woudl surely have been notified about this.
Any chance that you are using third party drives instead of HP drives? Non-HP drives will not fully report to the SmartArray.
-
@flomer said:
Perhaps I should try that? This incident got me thinking that perhaps I have to look closer at my ESX server also, since this machine is a DL385 with the same controller...
In this era I would definitely strongly consider virtualizing even a dedicated storage device. The stability and flexibility are almost always worth it.
-
@thanksaj said:
@flomer said:
I live in Norway and it's getting late here. I will return to this thread tomorrow!
Well WELCOME...in Norwegian...
velkommen
-
Welcome to the MangoLassi Community! Great group here.
-
@Dashrender said:
Interesting. This sounds like a similar issue I'm having with G8 server.
HP drives or third party? The G8 has even more firmware integration than in the past.
-
Welcome. You will find some very smart people around here.
-
@flomer said:
I live in Norway and it's getting late here. I will return to this thread tomorrow!
Welcome to Mangolassi..
Glad you find this Community