Why HALizard and XenServer Failed so heavily

DustinB3403

So to explain as simply as possible as to what happened.

The boot drive on host1 died, this host continued to run, and XAPI eventually failed on the host, causing Host2 to not be able to seize the primary role from host1.

This failure of the boot device (usb), but continued operation of XS on host1, was compounded by the fact that XAPI was hung on both servers.

Checking with the HALizard guys, we had to try and move the primary role over to host2, but couldn't. When we rebooted host1 and it failed to start up, that host became an ISCSI target for the pool, only hosting the VM's.

Why host2 didn't see this and use it without HALizard support (Sal was a great help) to get to the VM's. Unfortunately the VM's became corrupted during the failure and failed to boot (BSOD or fix me screens).

So it was full on recovery from that point.

Dashrender

This doesn't bode well for a change of lizard or DRDB.... Granted we don't have enough information, might never have, to blame one or the other.

Considering the industry standard to run VM hosts from USB sticks/SD cards. This type of failure, where the boot drive dies, bit the host keeps on working, seems almost common - it happened to me.

So, are you just not supposed to use HA Lizard and DRBD with USB/SD boot drives?

scottalanmiller

@Dashrender said in Why HALizard and XenServer Failed so heavily:

This doesn't bode well for a change of lizard or DRDB....

What makes you mention DRBD? No indication of any DRBD issue.

Dashrender

@scottalanmiller said in Why HALizard and XenServer Failed so heavily:

@Dashrender said in Why HALizard and XenServer Failed so heavily:

This doesn't bode well for a change of lizard or DRDB....

What makes you mention DRBD? No indication of any DRBD issue.

Ignorance.

Doesn't HA Lizard work on top of DRBD for the shared storage?
If not then I completely don't understand how it works.

scottalanmiller

@Dashrender said in Why HALizard and XenServer Failed so heavily:

@scottalanmiller said in Why HALizard and XenServer Failed so heavily:

@Dashrender said in Why HALizard and XenServer Failed so heavily:

This doesn't bode well for a change of lizard or DRDB....

What makes you mention DRBD? No indication of any DRBD issue.

Ignorance.

Doesn't HA Lizard work on top of DRBD for the shared storage?
If not then I completely don't understand how it works.

Yes it does, and no indication of a storage problem. So why mention storage when there was no storage issue (that we know of?)

scottalanmiller

If this has been a SAN, for example, the same issue would have arisen but we wouldn't blame the SAN, right?

Dashrender

Well he said his VMs were corrupt, but we don't know why.

Again I wasn't trying to blame (I specifically left room open) any specific piece for the problem. Instead I was listing what I knew of the possible culprits.
If you feel that DRBD is completely blameless I'd love to hear why to add to my understanding of how the system is supposed to work.

DustinB3403

@scottalanmiller said in Why HALizard and XenServer Failed so heavily:

@Dashrender said in Why HALizard and XenServer Failed so heavily:

This doesn't bode well for a change of lizard or DRDB....

What makes you mention DRBD? No indication of any DRBD issue.

Infact DRBD was running, and had no issues.

This is why Host1 was acting as ISCSI storage for Host2 to run the VM's.

That bond was working, without issue. The way the system failed (soft failure) since the hardware was still functional, and the VM's were still functional.

XAPI broke, along with the boot device, so we lost migration functionality, along with backup functionality.

scottalanmiller

@Dashrender said in Why HALizard and XenServer Failed so heavily:

Well he said his VMs were corrupt, but we don't know why.

No, we don't know why, but zero reason to suspect DRBD. Nothing points to or suggests DRBD. Yet we know that they were forced off, which is likely to corrupt them. So high chance and indicators pointing to that. That DRBD replicated the corruption is DRBD doing its job correctly.

So while there is no proof that DRBD wasn't involved, there isn't anything pointing to it.

scottalanmiller

@Dashrender said in Why HALizard and XenServer Failed so heavily:

If you feel that DRBD is completely blameless I'd love to hear why to add to my understanding of how the system is supposed to work.

Think of DRBD as RAID 1. If you get corruption on your RAID 1 volume, do you suspect that the RAID system corrupted the data or that corrupted data was written to the array? And yet it is possible that the RAID itself was the issue, sure. But it's not likely. There are places where we would expect it to happen, and a scenario to cause that to happen was there. If DRBD was going to corrupt things, likely it would have done it while the system was running, not at that exact moment. It's way too much of a coincidence.

scottalanmiller

What if you had just pulled out the cables so that it looked like a power failure. Wouldn't that have fixed things, except for the corrupted VMs? Might have protected against that too, but that's just random.

Dashrender

Dustin, how did you discover the failure?

Did things just stop working? IE your VMs went to read Only or just stopped responding?

dafyre

Stories like thie are why I'll happily continue to run my hypervisor on spinning rust / raid 1.

Yeah, sure it could happen to HDD too, but that's why we have raid1. If he'd been able to come up with a way to RAID1 two USB drives together, it might not have been an issue for him either.

(Can you not use MDRAID on two USB sticks?)

scottalanmiller

@Dashrender said in Why HALizard and XenServer Failed so heavily:

Dustin, how did you discover the failure?

Did things just stop working? IE your VMs went to read Only or just stopped responding?

If they went read only, that would cause us to look at DRBD. That's the kind of thing that a DRBD failure would look like.

scottalanmiller

@dafyre said in Why HALizard and XenServer Failed so heavily:

(Can you not use MDRAID on two USB sticks?)

Yup

DustinB3403

@scottalanmiller said in Why HALizard and XenServer Failed so heavily:

What if you had just pulled out the cables so that it looked like a power failure. Wouldn't that have fixed things, except for the corrupted VMs? Might have protected against that too, but that's just random.

We didn't know at the time that the boot drive was dead on host1. So we weren't certain of the status of the cluster.

Just that both systems had XAPI hung (vm's were running fine until we touched the cluster)

scottalanmiller

@DustinB3403 said in Why HALizard and XenServer Failed so heavily:

@scottalanmiller said in Why HALizard and XenServer Failed so heavily:

What if you had just pulled out the cables so that it looked like a power failure. Wouldn't that have fixed things, except for the corrupted VMs? Might have protected against that too, but that's just random.

We didn't know at the time that the boot drive was dead on host1. So we weren't certain of the status of the cluster.

Just that both systems had XAPI hung (vm's were running fine until we touched the cluster)

One of the risks of clusters, so much complexity. A single host you could have dropped to Xen directly and shut things down.

DustinB3403

@scottalanmiller Which is why we're going with the standalone servers using XO's continuous replication.

Reid Cooper

Wow, that's crazy. Glad that it recovered.

DustinB3403

@Reid-Cooper said in Why HALizard and XenServer Failed so heavily:

Wow, that's crazy. Glad that you had a recovery solution planned out, great job!

I FTFY.