Defining High Availability

scottalanmiller

@Jimmy9008 said in Defining High Availability:

I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

Industry standards are very general, hence why I tackled it around server numbers specifically. HA is used 99% (I made that up) by marketing, and only 1% by actual IT. IT needs to not use such a generality and must work with real numbers (X Nines) as goals. There is really no time that working with "HA" as a general concept works for IT, because it's a process not a product, and because achieving proper availability at cost is a sliding scale that we have to work with for everything.

So defining HA for a specific item (a server, wordpress, an ERP, etc.) is a case by case basis. Physical servers have a known industry standard, so an order of magnitude better (HA) or worse (LA) is easy to define. For anything software related, it is not so clear.

Then there is more to it, as well. If a standard server gets around five nines of availability. And HA is six nines, what if we need "in between" or "far more"? You can't work with a term like HA, you must define the "nines" and work with that.

scottalanmiller

@Jimmy9008 said in Defining High Availability:

My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

A 404 would be tough in that case. Because you might be calling something unavailable based on a bad request.

Example: My store must be open 24/7.

Problem (that a 404 represents): Customer went to the wrong address and didn't find my store.

scottalanmiller

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

Dashrender

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

scottalanmiller

@Dashrender said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

Yeah. Its nearly half!

Jimmy9008

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

Yeah. Its nearly half!

Hopefully it will remain as 48 seconds for the rest of the year. So, if that were to happen we would be: %99.99984779, correct?

scottalanmiller

@Jimmy9008 said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

Yeah. Its nearly half!

Hopefully it will remain as 48 seconds for the rest of the year. So, if that were to happen we would be: %99.99984779, correct?

Sounds about right. Six nines is just 2.6 seconds!

Jimmy9008

@scottalanmiller said in Defining High Availability:

@Jimmy9008 said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).

Yeah. Its nearly half!

Hopefully it will remain as 48 seconds for the rest of the year. So, if that were to happen we would be: %99.99984779, correct?

Sounds about right. Six nines is just 2.6 seconds!

Yeah, not long. That is unplanned downtime only though. We have plenty of planned downtime for running updates and other projects. But still, good.

I'm off to a new job on 1st of April so won't know the end of year figure. I'd hope it is around 48 seconds though.

scottalanmiller

@Jimmy9008 said in Defining High Availability:

Yeah, not long. That is unplanned downtime only though. We have plenty of planned downtime for running updates and other projects. But still, good.

Ah, we often describe those as "Planned Availability" and "Unplanned Availability". Most people talking HA want both at six nines.

scottalanmiller

NTG had two servers in our early days beat six nines over a decade. We just got lucky, but holy cow.

scottalanmiller

@Jimmy9008 keep in mind that resulting availability and risk aren't the same thing. Any five nines system is expected to hit six nines nine out of ten years. It's the average over the operating lifespan, not over a set interval. Otherwise any normal interval that you select would have 100% uptime.

So there are two ways to look at it reasonably...

Resulting Availability Over Operational Lifetime
Expected Availability Over Operational Lifetime

The first is what an individual system actually provides. The second is the average of all systems configured identically, over all of their operational lifetimes.

The first you measure. The second you project with simulations.

In extremely large systems, like BackBlaze, they get close approximations to the later through measurement because they look only at small components (like hard drives) of which they have substantiation numbers to create a reasonable approximation to a full number.

When I was on Wall St., we had 80,000 servers in our pool and so we had actual risk and availability numbers for the industry in datacenters like ours. But it still only told us about a handful of server models, and only under our exact conditions. And it still took a decade or more to produce meaningful numbers, and those numbers only applied to the servers of the past, not the ones being installed new.