ESXi recovery woes

Carnival Boy

Four ESXi hosts, two running ESXi version 5.1 (HP Proliant DL380 G6s) and two running version 5.5 (DL380 Gen9s). I have a VM running Windows 2008R2 and a document management system called Meridian that uses a proprietary database called Hypertrieve.

I take an online backup of the VM.

If I restore the backup to either of the hosts running 5.1 the VM appears to boot fine and will try and recover the database. But the recovery works fine on the two hosts running version 5.1, but fails on the two hosts running 5.5.

I’ve compared the Event Logs of both machines. They start off similar, with events like:
“Start Restore Database (0)”
“Start copying snapshot back (0)”
“End copying snapshot back (0)”
“Restore: database opened (0)”
“Restore : start truncate log (0)”
“Restore: end truncate log (0 bytes discarded) (0)”

But then, the good host logs this:
“Restore: end processing log (0)”
“Restore: complete”

But the bad host logs this:
“Error: File locked (133) OurDatabase.HDB”

I appreciate none of you will know the specifics of how a Hypertrieve database restores itself from an unclean shutdown, but I was wondering if any of you had a clue as to why there is inconsistency depending on where the VM is restored to. I'm really at a loss as to where to look. I'd always previously assumed that if the server will boot it will work the same regardless of which version of ESXi it is or which host it is on as the VM should operate transparently to the hypervisor.

The software vendor simply says they don't support hypervisor issues so I'm kind of own my own on this (nice).

scottalanmiller

So the real answer is that I have no idea. BUT, just guessing, is that if this is happening repeatably and reliably, is that the ESXi 5.5 snapping mechanism is just enough different that it is causing the database to see a different type of corruption. In both cases you are only crash consistent, that's common and expected. But that they recover differently, I'm guessing that the imaging agent changed, but I don't know how.

scottalanmiller

@Carnival-Boy said in ESXi recovery woes:

The software vendor simply says they don't support hypervisor issues so I'm kind of own my own on this (nice).

No problem, explain to him that this is a storage issue and has nothing to do with the hypervisor. The identical thing would happen if you were doing SAN snaps of a running VM.

Dashrender

@scottalanmiller said in ESXi recovery woes:

So the real answer is that I have no idea. BUT, just guessing, is that if this is happening repeatably and reliably, is that the ESXi 5.5 snapping mechanism is just enough different that it is causing the database to see a different type of corruption. In both cases you are only crash consistent, that's common and expected. But that they recover differently, I'm guessing that the imaging agent changed, but I don't know how.

Wait a second. Are the snaps being taken on a 5.1 only? or are they being taken on both?

I guess the way I read it was that the backups (what I assume Scott means by a snap) were taken on a 5.1 machine.

Scott, are you saying that output from that 5.1 data to the backups, when restored in the 5.5 would somehow be different? Even if you aren't saying that, why would the way snaps are taken make any difference. The data is taken on a 5.1 server, backed up to some media, then restored from that media onto a 5.5 server - why would the snapping tech be involved here?

scottalanmiller

Sorry, I was thinking that they were being snapped on 5.1 AND 5.5 and restored to what they were snapped from.

coliver

Really the question is how does the database lock? It sounds like there is some means of locking the database that is specific to a VM. Does the MAC address change when you try and restore to ESXi 5.5?

scottalanmiller

@coliver said in ESXi recovery woes:

Really the question is how does the database lock? It sounds like there is some means of locking the database that is specific to a VM. Does the MAC address change when you try and restore to ESXi 5.5?

Nothing would be specific to a VM. The virtual nature here is a red herring. This is purely about storage. At least as a root cause for the corruption. Why it sees the resulting storage differently in the two cases, that's likely virtualization related. But the database corruption in the first place is cause purely in the storage.

Dashrender

How is the storage different? Not that I disagree, just looking for more information.

scottalanmiller

@Dashrender said in ESXi recovery woes:

How is the storage different? Not that I disagree, just looking for more information.

That we don't know. What we know is that snapping will cause corruption with a database. So the corruption is expected and universal. What we don't know is why the snaps are loading one way in one version and another in another. I can only imagine that the block driver was changed between the two and something additionally is being affected.

Carnival Boy

I'm not sure what you mean by snapping? Do you mean VSS?

The issue happens if I take a backup on a 5.5 host and try and restore it on a 5.5 host. If I do that it will fail. But I can restore that same 5.5 host backup to a 5.1 host and it works fine. So the source of the backup doesn't seem to be an issue as much as the destination.

scottalanmiller

@Carnival-Boy said in ESXi recovery woes:

I'm not sure what you mean by snapping?

Slang for "taking a snapshot." That's the process that is introducing the initial corruption, I assume. The corruption should come from a block-based snapshot of the running database files.

scottalanmiller

It was the phrase "online backup of the VM" that I took to be a description of a snapshot based backup. Like Veeam would do.

Carnival Boy

I assume so. I have used both Veeam and Unitrends.

Dashrender

@Carnival-Boy said in ESXi recovery woes:

I'm not sure what you mean by snapping? Do you mean VSS?

The issue happens if I take a backup on a 5.5 host and try and restore it on a 5.5 host. If I do that it will fail. But I can restore that same 5.5 host backup to a 5.1 host and it works fine. So the source of the backup doesn't seem to be an issue as much as the destination.

OH than I stand corrected, you are trying to take backups (using snaps) on the 5.5 as well as the 5.1. So you have two of these Hypertrieve servers? one on 5.1 and another on 5.5?

Dashrender

So here's a question - does Hypertrieve have their own backup process for an online db? Some things do. before you kick off the backup of the VM, you kick off the backup process on the Hypertrieve DB, then the VM backup happens. Then when you restore, the Hypertrieve stuff will do it's own restore (you might have to do it manually) this is all in the name of preventing corruption.

Carnival Boy

I shut down the VM on the 5.1 host, migrated it to the 5.5 host, powered it on, took another backup, then restored it back to both the 5.1 and the 5.5 host. I've been pretty busy!

Dashrender

@Carnival-Boy said in ESXi recovery woes:

I shut down the VM on the 5.1 host, migrated it to the 5.5 host, powered it on, took another backup, then restored it back to both the 5.1 and the 5.5 host. I've been pretty busy!

So you can restore a snap taken from a 5.5 on a 5.1, but not back to the original 5.5 it came from... hmmm..

Carnival Boy

I'm still working on this

Even if I shut down the VM, back it up in a powered off state, restore to a 5.5 host, and power it on, the Hypertrieve service starts and opens the database, which I can successfully browse, then after about ten seconds it crashes and I can no longer browse the database.

Since this is a restore of a powered off VM, it can't be a snapshotting issue.

I have had a reply from the vendor, who writes:
"So, it's clear for us what happened, the virtualization abstraction generates a conflict when you instance a new VM just copying, the Disk Hash is not the same, and crashes the EDM sometimes, I don't recommend a Server Copy, always the backup procedure."

I don't really understand this. Anyone?

By "backup procedure" I think they are talking about taking a Hypertrieve backup via the Hypertrieve software and restoring the database that way after migrating.

Which I'm hoping to try next, but, to compound the issue, Unitrends (which I hate, by the way) has stopped working for me, so I can no longer restore the VM! It's just one thing after another with this - I can feel my life slowly slipping away!

DustinB3403

"So, it's clear for us what happened, the virtualization abstraction generates a conflict when you instance a new VM just copying, the Disk Hash is not the same, and crashes the EDM sometimes, I don't recommend a Server Copy, always the backup procedure."

This means that when you are importing the VM into the other host, it has a new Disk ID which is causing the issue, as the snapshot process creates a custom disk ID.

What they are recommending you do is a full backup, and import that which should resolve the issue.

Is there no built in way with the ESXi version to create a full backup? (I'm thinking of XO at this point so don't mind me if I'm completely wrong)

Dashrender

@Carnival-Boy said in ESXi recovery woes:

I have had a reply from the vendor, who writes:

"So, it's clear for us what happened, the virtualization abstraction generates a conflict when you instance a new VM just copying, the Disk Hash is not the same, and crashes the EDM sometimes, I don't recommend a Server Copy, always the backup procedure."

So does ESXi 5.1 somehow maintain the Disk HASH, and VMWare changed this practice in 5.5? Something for you to investigate.

@DustinB3403 said in ESXi recovery woes:

This means that when you are importing the VM into the other host, it has a new Disk ID which is causing the issue, as the snapshot process creates a custom disk ID.

eh? Actually, the OP proved it has nothing to do with the snap shots by taking a backup while the VM was shutdown.

This is a restore to a new VM problem. It's a problem because the vendor has the system checking the Disk ID, presumably for copy protection reasons, yet is easily thwarted by using a backup and restore procedure of the DB/application software itself. This of course means that restoring a system takes a potentially much longer time because not only do you have to restore the VM, but then you have to restore the DB inside the VM - assuming this is even possible, because I suppose you might have to reinstall the application before restoring the DB so that the application recognizes the new DISK HASH.