Defining High Availability



  • @NetworkNerd said in Defining High Availability:

    Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

    In reality, insurance's purpose is to mitigate loses from a lack of availability. It's like an admission of bad availability (or a lack of faith in it.) By definition, insurance only kicks in when availability fails.



  • I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

    My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

    Our development team probably calculate their up-time differently, but its all very interesting to me.

    We are at 48 seconds of unavailability so far in 2019. Providing that stays the same for the rest of the year per quarter (48 seconds per quarter), how many 9's does that project for us? (Not really sure how to calculate that)...

    So, how many 9's up-time would 192 seconds of downtime for a whole year be?



  • @Jimmy9008 said in Defining High Availability:

    So, how many 9's up-time would 192 seconds of downtime for a whole year be?

    365 (days) * 24 (hours) * 60 (min) * 60 (sec) = total seconds in 1 year =31,536,000

    31,536,000 - 192 =amount uptime = 31,535,808

    31,535,808 (actual uptime) / 31,536,000 (max uptime) = .9999939117 or in percent 99.99939117 % uptime



  • @Jimmy9008 said in Defining High Availability:

    I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

    My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

    Our development team probably calculate their up-time differently, but its all very interesting to me.

    Uptime can be many different things.

    The platform uptime, the application uptime, the internet connection uptime, etc.

    Assuming you're providing a service to someone else - they only thing they care about is the uptime they have connecting to that service. So 404 aren't the only thing they care about. If your app is dead, yet the page doesn't display 404, it's still an outage to the end user.

    I'm guessing for the most part, that's the one that primarily matters - so the fact that your team looks at only their bit - yeah, doesn't make the customer any happier.



  • @Jimmy9008 said in Defining High Availability:

    I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

    Industry standards are very general, hence why I tackled it around server numbers specifically. HA is used 99% (I made that up) by marketing, and only 1% by actual IT. IT needs to not use such a generality and must work with real numbers (X Nines) as goals. There is really no time that working with "HA" as a general concept works for IT, because it's a process not a product, and because achieving proper availability at cost is a sliding scale that we have to work with for everything.

    So defining HA for a specific item (a server, wordpress, an ERP, etc.) is a case by case basis. Physical servers have a known industry standard, so an order of magnitude better (HA) or worse (LA) is easy to define. For anything software related, it is not so clear.

    Then there is more to it, as well. If a standard server gets around five nines of availability. And HA is six nines, what if we need "in between" or "far more"? You can't work with a term like HA, you must define the "nines" and work with that.



  • @Jimmy9008 said in Defining High Availability:

    My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

    A 404 would be tough in that case. Because you might be calling something unavailable based on a bad request.

    Example: My store must be open 24/7.

    Problem (that a 404 represents): Customer went to the wrong address and didn't find my store.



  • @Dashrender said in Defining High Availability:

    .9999939117 or in percent 99.99939117 % uptime

    AKA: Five Nines

    Or more accurately, 5 Nines+

    That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability 🙂



  • @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    .9999939117 or in percent 99.99939117 % uptime

    AKA: Five Nines

    Or more accurately, 5 Nines+

    That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability 🙂

    And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).



  • @Dashrender said in Defining High Availability:

    @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    .9999939117 or in percent 99.99939117 % uptime

    AKA: Five Nines

    Or more accurately, 5 Nines+

    That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability 🙂

    And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

    Yeah. Its nearly half!



  • @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    .9999939117 or in percent 99.99939117 % uptime

    AKA: Five Nines

    Or more accurately, 5 Nines+

    That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability 🙂

    And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

    Yeah. Its nearly half!

    Hopefully it will remain as 48 seconds for the rest of the year. So, if that were to happen we would be: %99.99984779, correct?



  • @Jimmy9008 said in Defining High Availability:

    @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    .9999939117 or in percent 99.99939117 % uptime

    AKA: Five Nines

    Or more accurately, 5 Nines+

    That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability 🙂

    And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

    Yeah. Its nearly half!

    Hopefully it will remain as 48 seconds for the rest of the year. So, if that were to happen we would be: %99.99984779, correct?

    Sounds about right. Six nines is just 2.6 seconds!



  • @scottalanmiller said in Defining High Availability:

    @Jimmy9008 said in Defining High Availability:

    @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    @scottalanmiller said in Defining High Availability:

    @Dashrender said in Defining High Availability:

    .9999939117 or in percent 99.99939117 % uptime

    AKA: Five Nines

    Or more accurately, 5 Nines+

    That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability 🙂

    And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

    Yeah. Its nearly half!

    Hopefully it will remain as 48 seconds for the rest of the year. So, if that were to happen we would be: %99.99984779, correct?

    Sounds about right. Six nines is just 2.6 seconds!

    Yeah, not long. That is unplanned downtime only though. We have plenty of planned downtime for running updates and other projects. But still, good.

    I'm off to a new job on 1st of April so won't know the end of year figure. I'd hope it is around 48 seconds though.



  • @Jimmy9008 said in Defining High Availability:

    Yeah, not long. That is unplanned downtime only though. We have plenty of planned downtime for running updates and other projects. But still, good.

    Ah, we often describe those as "Planned Availability" and "Unplanned Availability". Most people talking HA want both at six nines.



  • NTG had two servers in our early days beat six nines over a decade. We just got lucky, but holy cow.



  • @Jimmy9008 keep in mind that resulting availability and risk aren't the same thing. Any five nines system is expected to hit six nines nine out of ten years. It's the average over the operating lifespan, not over a set interval. Otherwise any normal interval that you select would have 100% uptime.

    So there are two ways to look at it reasonably...

    1. Resulting Availability Over Operational Lifetime
    2. Expected Availability Over Operational Lifetime

    The first is what an individual system actually provides. The second is the average of all systems configured identically, over all of their operational lifetimes.

    The first you measure. The second you project with simulations.

    In extremely large systems, like BackBlaze, they get close approximations to the later through measurement because they look only at small components (like hard drives) of which they have substantiation numbers to create a reasonable approximation to a full number.

    When I was on Wall St., we had 80,000 servers in our pool and so we had actual risk and availability numbers for the industry in datacenters like ours. But it still only told us about a handful of server models, and only under our exact conditions. And it still took a decade or more to produce meaningful numbers, and those numbers only applied to the servers of the past, not the ones being installed new.