ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    Failing XenServer hosts are such a PITA

    IT Discussion
    3
    8
    739
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • ColoradogeekC
      Coloradogeek
      last edited by

      Lost #3 over the weekend again. I'm removing the host from the pool and leaving it out - the server isn't reliable. However in the process of doing so, #2's network stack is failing to come up, and even after a reboot of the master, I'm not feeling rosy about this system this morning. I'm sure I'll figure it out, but after coming off the high on Friday (Got Commvault working on a critical CentOS system) coming into Monday like this is a kick in the pants. The front side.

      1 Reply Last reply Reply Quote 1
      • ColoradogeekC
        Coloradogeek
        last edited by

        cant even do an emergency network reset on #2. and now its 8am. fuck.

        1 Reply Last reply Reply Quote 0
        • DustinB3403D
          DustinB3403
          last edited by

          Keep us informed.

          Removing pool members has never been an easy task as far as I recall.

          1 Reply Last reply Reply Quote 0
          • ColoradogeekC
            Coloradogeek
            last edited by

            Rebooted #1, comes up fine, but doesn't start any VMs. #2 has been rebooted twice now, NICs refuse to come online after a restart. Just did an emergency network reset, but after hitting enter, the system hangs. Not very happy about DR with XS7 this morning. Pretty pissed actually.

            #3 is offline and was forcibly removed from the cluster before this mess started.

            1 Reply Last reply Reply Quote 0
            • scottalanmillerS
              scottalanmiller
              last edited by

              Where there any issues with Node 2 before forcing 3 out of the cluster?

              1 Reply Last reply Reply Quote 0
              • ColoradogeekC
                Coloradogeek
                last edited by

                None. #1 and #2 have been rock solid. I introduced #3 some time after them, but it's had issues before, so I knew if it did it again, I would be removing it.

                Initially I removed it via the GUI by disabling HA, but that got hung up and started this whole mess. XS7 is a gem to set up, but like a vindictive mistress, it gets really twitchy when you try to change something I'm noticing.

                I finally just yanked the power plugs in desperation and gave #1 and #2 a timeout in the corner, then powered them up again. Guess what? VM's are starting now and the cluster is functional again. Even managed to re-enable HA.

                /facedesk

                DustinB3403D 1 Reply Last reply Reply Quote 3
                • DustinB3403D
                  DustinB3403 @Coloradogeek
                  last edited by

                  @Coloradogeek I'm curious what caused the issue in the first place. Introducing the 3rd server likely had something to do with it. Are all three servers the same model?

                  1 Reply Last reply Reply Quote 0
                  • ColoradogeekC
                    Coloradogeek
                    last edited by

                    I didn't say introducing the #3 server caused this - I just mentioned that after I added #3 to the pool a while back, it had a similar event. The first time it happened, I let it slide, waiting to see if it would do it again. Yesterday at 10am, it had the same type of problem (isolated itself and rebooted) so today I decided I'd better pull it so that it wouldn't do that during business hours. At the time it only had two VMs running on it, and they weren't at all taxing the system.

                    They are all identical systems - Dell R610's with the same CPUs, the only difference is that #3 uses 48GB of RAM while the other two are 96 GB. I even upgraded all of them to the exact same firmware revisions before putting them into production and after testing them. #3 is just a bad egg. Not sure what the problem is, but it's powered off right now and I'll take a look at it if I get time to this week.

                    1 Reply Last reply Reply Quote 2
                    • 1 / 1
                    • First post
                      Last post