How Reliable Is Your Server


  • Service Provider

    Servers are unreliable things, right? We hear about this all of the time. Everyone is concerned that their server(s) will fail. They fail left and right. It happens all of the time. Servers are fragile and need risk mitigation for nearly all situations. They don't even have internally redundant components, right?

    Wrong. None of this is true. Once upon a time it was true that commodity servers and their corresponding operating systems were highly fragile, but this was the early 1990s and even then the failure risks were mostly limited to the storage layer and mostly limited to only those systems where cost cutting measures reduced reliability far below what was available at the time.

    Enterprise servers have long been highly reliable, even going back to the 1970s, and commodity servers entered this world of being highly reliable by the late 1990s and in the 2000s moved even closer, especially with the advents of 64bit computing and full virtualization. Today commodity enterprise servers like those from HPE's Proliant line and Dell's PowerEdge line are incredibly reliable. When properly designed, built and maintained reliability might move towards the six nines range! This puts a normal server well into consideration for "high availability" right from the onset today.

    Standard servers do this through a couple techniques. One is by simply using very solid, well engineered components. Parts like processors and motherboards have come a very long way and almost never fail, even after a decade or more of continuous abuse. But some parts will always continue to have some risk, power supplies and hard drives being some of the riskiest components. In modern enterprise commodity servers nearly all reasonable components are redundant and field serviceable and nearly always hot swappable. Hot swap power supplies, hard drives, fans and more are standard. Pretty much every component with significant risk is already redundant, field replaceable and can be done live without any downtime even after a component has failed. And others, like NICs, are often redundant as well.

    Even two decades ago it was standard to have hot swappable PCI slots so that support components could be replaced without downtime!

    Of course these are only commodity servers that we are talking about. Today even AMD64 architecture servers are available in non-commodity approaches (mini computers and better.) RAS features (reliability, accesibility and serviceability) on mini (HPE Integrity, SuperDome, Oracle M, IBM Power, Fujitsu Sparc) and mainframe systems are extreme and go far beyond what can be done with commodity servers. Hot swappable memory, backplanes, CPUs, controllers and even motherboards are standardly available. Downtime isn't a word that systems like these know, at all.

    Simply put, servers today are not the fragile things that they were twenty or thirty or even forty years ago. Servers are generally rock solid, incredibly reliable devices. The idea that servers will simply die regularly, that they are unreliable and need to be protected from hardware failure in all cases is emotional, irrational and based off of the fears of not just a different era, but a totally different generation entirely.

    Before giving in to fear that your server will stop functioning every few months, take a minute to think... perhaps your servers are more reliable than you give them credit for.


  • Service Provider

    And all of that is before we introduce new software technologies like virtualization that allow us to build server clusters. There are many software techniques available on commodity servers today that were also only on mini-computers and better two decades and longer ago. Good hardware and software and system design and maintenance in combination is what produces highly reliable systems. Modern approaches in all of these areas are providing us with reliability and uptimes that were unthinkable just two decades ago.



  • I am more afraid of the software ON the server than the server itself.

    Oh, and recently, USB boot devices. 🙂



  • @BRRABill said :

    Oh, and recently, USB boot devices. 🙂

    That's why you always have two USB devices 😉


  • Service Provider

    @BRRABill said in How Reliable Is Your Server:

    I am more afraid of the software ON the server than the server itself.

    Oh, and recently, USB boot devices. 🙂

    You don't protect that in the same way. You just restore from backup.



  • @scottalanmiller said in How Reliable Is Your Server:

    @BRRABill said in How Reliable Is Your Server:

    I am more afraid of the software ON the server than the server itself.

    Oh, and recently, USB boot devices. 🙂

    You don't protect that in the same way. You just restore from backup.

    How many spare Bootable USB clones would you recommend for most SMB's?

    It might seem like an odd question but I'm sure someone is asking it.

    I keep 1 spare at all times for my USB booted systems.



  • Humans are the weakness in the chain, have been for a long time now.

    Chances are exceptionally good that if you are having issues, PEBKAC.



  • @DustinB3403 Just order 3 more of these from Amazon for $12.49 each (current Deal of the Day).



  • Just seems like a strange thing, to have to keep spare USBs around because they fail.

    I mean, yeah they are cheaper, but what is the time cost to keep making backups, keep buying USBs, manage the backups and USBs, etc...

    You could setup a small 2 disk array and accomplish the same thing. Probably easier (at least on XS) to just backup the config.

    What is the point again? Or am I, like usual, missing it?



  • @BRRABill said in How Reliable Is Your Server:

    Just seems like a strange thing, to have to keep spare USBs around because they fail.

    I mean, yeah they are cheaper, but what is the time cost to keep making backups, keep buying USBs, manage the backups and USBs, etc...

    You could setup a small 2 disk array and accomplish the same thing. Probably easier (at least on XS) to just backup the config.

    What is the point again? Or am I, like usual, missing it?

    They don't fail that often is what you're missing.

    The time to clone a USB in minutes a month (or every few months).

    The time to restore a config in XS would be hours, at the point in time it crashes. If not longer. Plus you have no recent backup to work from.



  • @DustinB3403 said

    They don't fail that often is what you're missing.

    The time to clone a USB in minutes a month (or every few months).

    The time to restore a config in XS would be hours, at the point in time it crashes. If not longer. Plus you have no recent backup to work from.

    Understood.

    Do they really not fail that much? We've seen a few on ML just this month.

    Coincidence, maybe.

    I wonder if you let the logs write to the USB stick, would it really die quickly, anyway?



  • @BRRABill said in How Reliable Is Your Server:

    @DustinB3403 said

    They don't fail that often is what you're missing.

    The time to clone a USB in minutes a month (or every few months).

    The time to restore a config in XS would be hours, at the point in time it crashes. If not longer. Plus you have no recent backup to work from.

    Understood.

    Do they really not fail that much? We've seen a few on ML just this month.

    Coincidence, maybe.

    I wonder if you let the logs write to the USB stick, would it really die quickly, anyway?

    USB storage sticks are very hit and miss with their reliability.

    The only brand I haven't had a problem with are the Micro Center branded USB drives. They're also the only ones I know of that give you a lifetime guarantee. Walk in with a bad USB drive and walk out with a new one.



  • @BRRABill said in How Reliable Is Your Server:

    @DustinB3403 said

    They don't fail that often is what you're missing.

    The time to clone a USB in minutes a month (or every few months).

    The time to restore a config in XS would be hours, at the point in time it crashes. If not longer. Plus you have no recent backup to work from.

    Understood.

    Do they really not fail that much? We've seen a few on ML just this month.

    Coincidence, maybe.

    I wonder if you let the logs write to the USB stick, would it really die quickly, anyway?

    You can always just have one extra USB and keep a copy of the image somewhere. Once you need to use your backup USB then just order another one and write the image to it.



  • @stacksofplates said

    You can always just have one extra USB and keep a copy of the image somewhere. Once you need to use your backup USB then just order another one and write the image to it.

    I guess my point is going along with the OP of reliability, that two small SATA drives would probably run for years without needing a reboot. My servers are 10+ years old, and have just recently started having drive failures. That 24x7x365x10 (or whatever haha) without needing spares and worrying constantly it was going to fail.

    Why introduce that is so finicky into a server situation if we are concerned about reliability.



  • @BRRABill said in How Reliable Is Your Server:

    @stacksofplates said

    You can always just have one extra USB and keep a copy of the image somewhere. Once you need to use your backup USB then just order another one and write the image to it.

    I guess my point is going along with the OP of reliability, that two small SATA drives would probably run for years without needing a reboot. My servers are 10+ years old, and have just recently started having drive failures. That 24x7x365x10 (or whatever haha) without needing spares and worrying constantly it was going to fail.

    Why introduce that is so finicky into a server situation if we are concerned about reliability.

    Oh I'm not arguing that they drives wouldn't last longer, just that it's cheap to replicate the USB drives. I think stopping log writing to the USB drive would drastically increase the life of it. You could also just load the whole hypervisor to a RAM disk ha.



  • @stacksofplates said

    I think stopping log writing to the USB drive would drastically increase the life of it.

    Haha my server responded to me doing this by crashing and burning.



  • @BRRABill said in How Reliable Is Your Server:

    @stacksofplates said

    I think stopping log writing to the USB drive would drastically increase the life of it.

    Haha my server responded to me doing this by crashing and burning.

    Lol, you could always do software RAID 1 with two USB drives.


  • Service Provider

    @stacksofplates said in How Reliable Is Your Server:

    @BRRABill said in How Reliable Is Your Server:

    @stacksofplates said

    You can always just have one extra USB and keep a copy of the image somewhere. Once you need to use your backup USB then just order another one and write the image to it.

    I guess my point is going along with the OP of reliability, that two small SATA drives would probably run for years without needing a reboot. My servers are 10+ years old, and have just recently started having drive failures. That 24x7x365x10 (or whatever haha) without needing spares and worrying constantly it was going to fail.

    Why introduce that is so finicky into a server situation if we are concerned about reliability.

    Oh I'm not arguing that they drives wouldn't last longer, just that it's cheap to replicate the USB drives. I think stopping log writing to the USB drive would drastically increase the life of it. You could also just load the whole hypervisor to a RAM disk ha.

    Outside of systems logging to the USBs dying, I really never run into them having problems.



  • @scottalanmiller said

    Outside of systems logging to the USBs dying, I really never run into them having problems.

    Just for giggles, how much data do you think can be written to a USB drive before it kicks the bucket?

    Like say you left logging on for some crazy reason.

    How long would you feel "safe" using the USB?


  • Service Provider

    @BRRABill said in How Reliable Is Your Server:

    @scottalanmiller said

    Outside of systems logging to the USBs dying, I really never run into them having problems.

    Just for giggles, how much data do you think can be written to a USB drive before it kicks the bucket?

    Like say you left logging on for some crazy reason.

    How long would you feel "safe" using the USB?

    Not very long. Totally not how they are meant to be used. Their utility is in being a write once, read many device. In fact, I'd recommend hitting that little lock option on the side if it is available.



  • @scottalanmiller said

    Not very long. Totally not how they are meant to be used. Their utility is in being a write once, read many device. In fact, I'd recommend hitting that little lock option on the side if it is available.

    Ooooh, the server really wouldn't like that!


  • Service Provider

    This point just came up, again, on SW. SAN promoted not because SANs are reliable, but because servers must not be. This assumption drives so many recommendations, it's crazy.



  • @scottalanmiller said in How Reliable Is Your Server:

    This point just came up, again, on SW. SAN promoted not because SANs are reliable, but because servers must not be. This assumption drives so many recommendations, it's crazy.

    As you had a hard time convincing me, originally... It's all about perception.


  • Service Provider

    What's humorous in this one particular example, is that the server and SAN would come from the same vendor. Which is common. And someone asked the vendor to speak up. But the vendor is trapped. Because the question has to be framed as a relative safety concern and for an IPOD to be sensible the SAN must be orders of magnitude safer than the servers. Which means that the storage team would be forced to throw the server team (who makes their SANs for them) under the bus in order to sell the SAN, which in doing so would make their SANs look bad by being built by the very team that they just said could not make reliable gear.

    I guarantee the vendor is going to stay out of it. Even if the servers were not reliable, they would not be in a position to say anything. But the servers are extremely reliable.



  • @BRRABill said in How Reliable Is Your Server:

    I am more afraid of the software ON the server than the server itself.

    Oh, and recently, USB boot devices. 🙂

    Yeah, main causes of downtime:

    1. Humans
    2. Software
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
    3. Hardware

    I had a RAID controller fail on me on a fairly new Proliant recently. That wasn't nice. But HP said it was a firmware issue and upgraded the firmware. So even what I thought was a hardware failure was actually software - not that it makes any difference as the server is still down.

    Basically, it seems that if it doesn't move, it doesn't fail.



  • @Carnival-Boy Software is written by humans, so couldn't we condense that down to?

    1. Humans
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
    2. Hardware

  • Service Provider

    @travisdh1 said in How Reliable Is Your Server:

    @Carnival-Boy Software is written by humans, so couldn't we condense that down to?

    1. Humans
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
    2. Hardware

    But hardware is made by humans. So...

    1. Humans.


  • @scottalanmiller said in How Reliable Is Your Server:

    @travisdh1 said in How Reliable Is Your Server:

    @Carnival-Boy Software is written by humans, so couldn't we condense that down to?

    1. Humans
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
    2. Hardware

    But hardware is made by humans. So...

    1. Humans.

    We have a winner!


  • Service Provider

    @travisdh1 said in How Reliable Is Your Server:

    @scottalanmiller said in How Reliable Is Your Server:

    @travisdh1 said in How Reliable Is Your Server:

    @Carnival-Boy Software is written by humans, so couldn't we condense that down to?

    1. Humans
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
      .
    2. Hardware

    But hardware is made by humans. So...

    1. Humans.

    We have a winner!

    Even humans are made by other humans. Meta human failure!


  • Service Provider

    Adding in @HPEStorageGuy as we were discussing this exact topic a few minute ago on SW in another thread.



Looks like your connection to MangoLassi was lost, please wait while we try to reconnect.