Failing XenServer hosts are such a PITA

Coloradogeek

Lost #3 over the weekend again. I'm removing the host from the pool and leaving it out - the server isn't reliable. However in the process of doing so, #2's network stack is failing to come up, and even after a reboot of the master, I'm not feeling rosy about this system this morning. I'm sure I'll figure it out, but after coming off the high on Friday (Got Commvault working on a critical CentOS system) coming into Monday like this is a kick in the pants. The front side.

Coloradogeek

cant even do an emergency network reset on #2. and now its 8am. fuck.

DustinB3403

Keep us informed.

Removing pool members has never been an easy task as far as I recall.

Coloradogeek

Rebooted #1, comes up fine, but doesn't start any VMs. #2 has been rebooted twice now, NICs refuse to come online after a restart. Just did an emergency network reset, but after hitting enter, the system hangs. Not very happy about DR with XS7 this morning. Pretty pissed actually.

#3 is offline and was forcibly removed from the cluster before this mess started.

scottalanmiller

Where there any issues with Node 2 before forcing 3 out of the cluster?

Coloradogeek

None. #1 and #2 have been rock solid. I introduced #3 some time after them, but it's had issues before, so I knew if it did it again, I would be removing it.

Initially I removed it via the GUI by disabling HA, but that got hung up and started this whole mess. XS7 is a gem to set up, but like a vindictive mistress, it gets really twitchy when you try to change something I'm noticing.

I finally just yanked the power plugs in desperation and gave #1 and #2 a timeout in the corner, then powered them up again. Guess what? VM's are starting now and the cluster is functional again. Even managed to re-enable HA.

/facedesk

DustinB3403

@Coloradogeek I'm curious what caused the issue in the first place. Introducing the 3rd server likely had something to do with it. Are all three servers the same model?

Coloradogeek

I didn't say introducing the #3 server caused this - I just mentioned that after I added #3 to the pool a while back, it had a similar event. The first time it happened, I let it slide, waiting to see if it would do it again. Yesterday at 10am, it had the same type of problem (isolated itself and rebooted) so today I decided I'd better pull it so that it wouldn't do that during business hours. At the time it only had two VMs running on it, and they weren't at all taxing the system.

They are all identical systems - Dell R610's with the same CPUs, the only difference is that #3 uses 48GB of RAM while the other two are 96 GB. I even upgraded all of them to the exact same firmware revisions before putting them into production and after testing them. #3 is just a bad egg. Not sure what the problem is, but it's powered off right now and I'll take a look at it if I get time to this week.