My Server Crash Writeup 11-10-2015
-
NOTE: Anything I think I've learned here, discussed here, or needs discussed here, I'll mark with a *ML.
My day started with only a mild ominous touch. It was raining, and my youngest daugher thought she was going to throw up. My wife decided to stay home, so I went in to work. Good thing I did.
The day was going fine until around 3:00PM. I noticed our e-mail server was moving slowly. Tried Remote Desktop, was not responding. Time to go to the local box, possibly the old fashioned hard reboot.
When I got into the server room, I noticed one of the drives on our main (and ONLY) data server was blinking amber. I go from a 1 to a 5 on the 1 to 10 anxiety scale because that kind of stuff always makes me nervous. Anyway, no problem, I have spare drives on the shelf ready to go. I pull out the old drive. No problem. I put in the new drive, no problem. I go to log in to start rebuilding the array, and I notice that the server is rebooting. Hmm, that's odd. I look at the drive. Now TWO of the four are blinking amber. I've now gone to a 10, LOL.
Turns out a second drive failed after I did the hot plug. I now realize my data server array is gone. 25 years of data possibly gone forever. Let's hope our Datto device is as good as advertised.
I started up a a hybrid virtualization on our Datto. (Our Datto ALto 2 device cannot virtualize locally, only in the cloud.) Within about 15 minutes of the "event", we had a virtual server up and running on the LAN. The users could go about working as they had been. I made an announcment not to save anything to the server, and began the process of doing a BMR.
*ML1: I actually have extra non-OEM licenses of Server 2003, so this was actually a legit use of the technology.
*ML2: The reason I said not to save is because the Datto allows you to save, but then you need to do a backup on the virtual device, and then do the BMR from that. Since our device is VM in the cloud only, that would not be a great option. All other Datto devices virtual locally and in the cloud, so that would be more feasible.
The BMR is where the trouble began. We have a brand new server, but I did not want to use that, as that will be the platform for our new Hyper-V VMs. I grabbed a spare desktop we had around that also had an Intel RAID controller in it. I plugged in an SSD, and began the BMR. In my tests, I had some issues with BMR, so just in case, I only restored the boot drive. In those test issues, I was able to fix it with the StorageCraft Recovery Environment. (Datto uses ShadowProtect as its backup program.) But we were not able to fix this particular issue. It booted to a black screen. After a while on the phone with Datto support, I decided to BMR another machine while the tech did some backend work on the Datto box to try another BMR method. I got the second desktop up and running, received a STOP 7B error, and was able to fix it with the StorageCraft recovery CD. But then got another strange error, a C0000135 error. I started Googling this while the Datto tech did another BMR on the Intel RAID machine.
Google told me this error was caused by a recent Windows Update. I was able to boot into the SC recovery environment and manually "uninstall" the KB update (by copying files from the uninstall folder for the KB) that caused the 135 error. With fingers crossed I rebooted the machine, and it came up. I started to restore the data drive image.
This took about 75 minutes for 100GB. Rebooted again, and everything was exactly how it was at 2:59.
So it took about 12 hours, but I had the server back up as it was. About 9 of the hours was getting the BMR to work. I think most of the wasted time was due to a driver issue with the Intel RAID card.
*ML3: There has been a lot of discussion about BMR and why it's not always a great idea. I'm not sure how I could have made this better. I considered in the future having a machine I knew I could BMR to, but I'm not sure if the image itself (and the filesit has loaded) makes a difference in what to BMR to.
The server is now running on a DELL desktop with a single SSD, but it's up. Considering the age of the server, this is honestly probably a better solution! The data on this machine will be moved to the new server once that is up and running (whenever 2016 comes out).
The main thing I took away from this is ... working backups are so, so important. I also understand why virtualizing is so awesome ... no need to worry about these hardware issues.
-
75min from 100GB? Sounds like your disk arrays are a bit slow.
-
@Jason said:
75min from 100GB? Sounds like your disk arrays are a bit slow.
That's over the LAN. Might have been a bit less time and a bit more GB.
-
-
I had to do the math... the best expectation to be around 17 mins for 100 GB, So yea 75 seem a bit long.. but that depends one whatever else you have happening on either of the disk arrays.
-
Sorry it was 126GB.
And it might have only been 60 minutes. I didn't time it because I hightailed out of there to get to a Wawa for a hoagie. Then a second Wawa when the first was closed for some inexplicable reason.
It's not an array. It's from the Datto to the new server, in the BMR environment.
-
BTW: things are ridiculously fast on this new temporary machine.
It's just an i3 or an i5, but I think the SSD is what is making it fast.
Server 2003 never had it so good!
-
@BRRABill said:
*ML3: There has been a lot of discussion about BMR and why it's not always a great idea.
Those discussions have been 100% about why imaging as a process is not generally appropriate for workstations. In every case it has been pointed out that for servers it is standard and generally very good to be able to do. No amount of it being a poor technology choice for desktops would imply that it is bad for servers.
-
@BRRABill said:
It's just an i3 or an i5, but I think the SSD is what is making it fast.
Storage is normally nearly the entire bottleneck in SMB systems.
-
@scottalanmiller said:
@BRRABill said:
*ML3: There has been a lot of discussion about BMR and why it's not always a great idea.
Those discussions have been 100% about why imaging as a process is not generally appropriate for workstations. In every case it has been pointed out that for servers it is standard and generally very good to be able to do. No amount of it being a poor technology choice for desktops would imply that it is bad for servers.
do you mean that restoring bare metal servers with system images is not generally appropriate ???
(sorry sometimes your english american guys looks confusing to me, i think it is slang not academic, right)
-
@IT-ADMIN said:
@scottalanmiller said:
@BRRABill said:
*ML3: There has been a lot of discussion about BMR and why it's not always a great idea.
Those discussions have been 100% about why imaging as a process is not generally appropriate for workstations. In every case it has been pointed out that for servers it is standard and generally very good to be able to do. No amount of it being a poor technology choice for desktops would imply that it is bad for servers.
do you mean that restoring bare metal servers with system images is not generally appropriate ???
(sorry sometimes your english american guys looks confusing to me, i think it is slang not academic, right)
No slang, no sarcasm. Imaging is good for servers and virtualized workstations, imaging is generally bad for physical workstations.
-
now i understand you, thank you
and sorry for my poor english, sometimes i want to make sure whether i understand your idea or not, for this reason i ask a question after an answer -
so virtualization is the safest solution for disaster recovery
-
@IT-ADMIN said:
so virtualization is the safest solution for disaster recovery
Every server should be virtualized 100% of the time. That includes the disaster recovery. Any use of a physical server is negative.
(Yes there are super rare exceptions, no they don't apply to anyone wondering if it applies to them.)
-
@scottalanmiller unless it applies, then it applies. if it doesn't, it wont. ok?
-
@scottalanmiller said:
Those discussions have been 100% about why imaging as a process is not generally appropriate for workstations. In every case it has been pointed out that for servers it is standard and generally very good to be able to do. No amount of it being a poor technology choice for desktops would imply that it is bad for servers.
I meant that in that sometimes it is a chore to get the BMR working, and sometimes it doesn't at all.
The "store the data separately" concept.
Obviously that wouldn't work for database servers, etc..
-
@scottalanmiller said:
Every server should be virtualized 100% of the time. That includes the disaster recovery. Any use of a physical server is negative.
For DR ... how do you back up the VM? The actual machine itself? Or the VHDX file?
-
@BRRABill said:
@scottalanmiller said:
Every server should be virtualized 100% of the time. That includes the disaster recovery. Any use of a physical server is negative.
For DR ... how do you back up the VM? The actual machine itself? Or the VHDX file?
You use something like Veeam to backup your VMs to whatever Backup storage you have.
Veeam takes care of the whole thing, you don't worry about the VHDX. -
@Dashrender said:
You use something like Veeam to backup your VMs to whatever Backup storage you have.
Veeam takes care of the whole thing, you don't worry about the VHDX.My question, I guess, is....
If I had the VHDX, then hardware would be totally out of the equation. I would think if I back up the VM itself as a server, it needs to be restored as such.
But I will admit to being unfamiliar with the backup of VMs, having on physical servers as the current juncture. (Soon to change though, hence the questions. )
-
@BRRABill said:
For DR ... how do you back up the VM? The actual machine itself? Or the VHDX file?
Generally no. You can use Snapshots but, most of them also support using a client which is far more powerful. You basically get a SysPrep'ed image that boots to WinPE like this. Meaning teoortecially you could switch from Hyper-V to Vmware if you needed to (or vice-versa) on a backup (or even to a physical box if the whole virtual system went crazy). With snapshots you're locked in.