XenServer - Crash post mortem



  • My XenServer crashed last night during a file copy process.

    I had shut down all of my VMs, connected with WinSCP and was copying a 700 GB VHD to an USB attached drive, attached to my PC.

    It went well until 556 GB and then XenServer crashed.

    The system sat in a boot loop until I got in this morning because it wouldn't boot from the SD card where XS was installed.

    I powered off/on the server and it still wouldn't boot from the SD.

    I powered down the server, removed the SD card and ran Spinrite on it. But that was so slow I gave up after about 10 mins. It was less than 0.1% done.

    But I know that often the boot sector alone gets messed up on things like this, so I put the SD card back into the server and tried booting again.

    This time it came right up and is working.

    All I can say is WTF?

    The SD card is a 32 GB Sandisk.

    Suggestions on what I should to as a backup? This server (HP DL 380p Gen8) does not have dual SD slots on the MB that I've noticed.



  • You image the SD card, make another identical SD, test it and keep it taped to the back of the server.



  • @Dashrender Add a usb drive and use dd to copy the root file system to the usb. IE

    dd if=/dev/sda of=/dev/sdb bs=512k
    

    Make sure the if= is the sd card and of= is the usb stick. Try booting the usb stick on another computer to test.



  • @scottalanmiller said:

    You image the SD card, make another identical SD, test it and keep it taped to the back of the server.

    I'm assuming I have to power down the server to do that?

    Can I only make an image from within a nix system? Should I use a dd command to create the image?



  • @Dashrender said:

    @scottalanmiller said:

    You image the SD card, make another identical SD, test it and keep it taped to the back of the server.

    I'm assuming I have to power down the server to do that?

    Can I only make an image from within a nix system? Should I use a dd command to create the image?

    Yes, you have to power down to image your boot device 🙂

    No, you can do it other places. But the amount of work will be much higher. This is one of the many "trivial on UNIX, crazy hard on windows" things.



  • When I tried to look at the SD card inside Windows - Windows kept offering to format the drive for me. Of course I told it no - then booted from Spinrite to attempt to fix it.

    I didn't boot to Windows again to see if I could read the drive after Spinrite ran for a few mins, but I didn't see point, it's likely formatted in a format Windows can't read so it would probably just continue to offer to format it.

    FYI - windows did see two partitions on the SD Card, they were both 4 GB, and the rest of the space was just left blank.



  • So continuing with Post Mortem talk - What should I look at to see if I can figure out why it crashed in the first place?



  • @Dashrender said:

    When I tried to look at the SD card inside Windows - Windows kept offering to format the drive for me. Of course I told it no - then booted from Spinrite to attempt to fix it.

    I didn't boot to Windows again to see if I could read the drive after Spinrite ran for a few mins, but I didn't see point, it's likely formatted in a format Windows can't read so it would probably just continue to offer to format it.

    FYI - windows did see two partitions on the SD Card, they were both 4 GB, and the rest of the space was just left blank.

    yes, Windows is not a useful tool here.



  • @Dashrender said:

    So continuing with Post Mortem talk - What should I look at to see if I can figure out why it crashed in the first place?

    It just sounds like either the SD card or the SD reader had an issue. It was probably that simple.



  • If it didn't boot, it would never have gotten to the point of logging.



  • For Windows I use ImageUSB from PassMark

    0_1457535355860_upload-9470a44e-2266-4c80-8eba-7ee34ed2f23e



  • @scottalanmiller said:

    If it didn't boot, it would never have gotten to the point of logging.

    When Windows crashes, it still has the ability to write things back to the disk during the crash process - does nix not do the same?

    Assuming the boot did work, are you suggesting that the system would retain in memory some information about the crash that could then have been written to the logs after rebooting - This would defy everything I know about rebooting, so I'm sure I'm just misunderstanding you.



  • @Dashrender said:

    When Windows crashes, it still has the ability to write things back to the disk during the crash process - does nix not do the same?

    But Linux didn't crash here, right? It didn't boot up at all?



  • @JaredBusch said:

    For Windows I use ImageUSB from PassMark

    And I use Winimage.

    Will ImageUSB read things that Windows itself can't?



  • @Dashrender said:

    Assuming the boot did work, are you suggesting that the system would retain in memory some information about the crash that could then have been written to the logs after rebooting - This would defy everything I know about rebooting, so I'm sure I'm just misunderstanding you.

    I thought that we knew that the issue was that the storage disconnected. So Windows, Linux or otherwise here are the issues with logging...

    • The device to which to log isn't writeable so no OS capability will fix that. There is nowhere to log (this is why logs should always go to an external collector like ELK, ELG, Logg.ly, etc.
    • When the system was having issues, it was unable to boot into Xen or Linux in any way, so the logging mechanisms would not be there anyway.
    • The issue you are dealing with is with the hardware, not with the software, so software logging doesn't sound very important here.


  • @scottalanmiller said:

    @Dashrender said:

    When Windows crashes, it still has the ability to write things back to the disk during the crash process - does nix not do the same?

    But Linux didn't crash here, right? It didn't boot up at all?

    XS crashed - well at least I'm assuming it did. I left work, the server was running, XS was running a copy process over the network was in happening.

    When I arrived in the morning, the server was stuck in a boot loop - most likely because it couldn't read the SD card.

    The question is, why did the server reboot - I'm assuming because XS crashed and auto restarted.



  • @Dashrender said:

    @JaredBusch said:

    For Windows I use ImageUSB from PassMark

    And I use Winimage.

    Will ImageUSB read things that Windows itself can't?

    By definition, an image reads the drive, not the filesystem. Windows ability to mount the filesystem is unrelated. Imaging is imaging, there aren't Windows specific imaging capabilities.



  • @scottalanmiller said:

    @Dashrender said:

    @JaredBusch said:

    For Windows I use ImageUSB from PassMark

    And I use Winimage.

    Will ImageUSB read things that Windows itself can't?

    By definition, an image reads the drive, not the filesystem. Windows ability to mount the filesystem is unrelated. Imaging is imaging, there aren't Windows specific imaging capabilities.

    yeah, I assumed as much - but needed to make sure.



  • @Dashrender said:

    XS crashed - well at least I'm assuming it did. I left work, the server was running, XS was running a copy process over the network was in happening.

    Sure, we would assume this would happen if the storage failed. What we know after the crash is that the storage was having issues at the hardware level. Is it possible that XS crashed from a software error and then, totally coincidence, the hardware failed at the same time, in such a way that it would have caused the original crash but didn't? Sure. But did it really? No, you know what the issue is here with 99.9% certainty.



  • @Dashrender said:

    The question is, why did the server reboot - I'm assuming because XS crashed and auto restarted because the filesystem became unavailable causing there to be nowhere to write logs.



  • On a somewhat unrelated note, thanks for getting me to look at the /var/log directory on the XenServer here. Had a bunch of old logs clogging things up (500MB on the tiny little root partition.) Moved those off to my workstation for now, probably need to take some time and see what's going on.



  • @scottalanmiller said:

    @Dashrender said:

    When Windows crashes, it still has the ability to write things back to the disk during the crash process - does nix not do the same?

    But Linux didn't crash here, right? It didn't boot up at all?

    I guess I mistook this to be you saying that Linux didn't crash - it did, but not because of software, but because of hardware, so it's not Linux's fault, it's hardware's fault.

    OK I gotcha! 😉

    So now I need to get another SD card and create images of it.



  • @travisdh1 said:

    On a somewhat unrelated note, thanks for getting me to look at the /var/log directory on the XenServer here. Had a bunch of old logs clogging things up (500MB on the tiny little root partition.) Moved those off to my workstation for now, probably need to take some time and see what's going on.

    Glad my freak out could help ya out 🙂


Log in to reply