Defining High Availability

scottalanmiller

@dafyre said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

Not that I have seen, very much the opposite . SMBs seem to focus on the availability of a single layer, often one chosen at random, without even considering the availability of the entire system.

For example, they tend to focus on layers like platform where products with an "HA checkbox" exist but ignore layers on which they depends (storage, power, WAN, cooling, etc.) or layers that depend on the platform (OS, application, LAN) so that a large investment in "HA products" is lost when the other components above and below the layer with HA features don't support HA through to the end users.

IE: Buying a two node, fully replicated SAN, but not buying redundant UPSes and Network switches?

Exactly.

Dashrender

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

dafyre

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

Exactly. Critical thinking is what lead us to purchasing an Active/Active 2-node SAN Cluster along side our VMware infrastructure at my last job...

Q: "What happens if we lost power to this building for 5 minutes?"
A: We're fine, we have UPSes.

Q: "What happens if we lost power to this building for 5 days?"
A: We're screwed unless we buy a backup generator.
BOSS: We have an underutilized generator in another building.
IT: Can we get a room in that building?

scottalanmiller

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

That's the same as not focusing on it at all.

Dashrender

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

That's the same as not focusing on it at all.

But at least we now more clearly understand where the breakdown is.

scottalanmiller

@Dashrender said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

That's the same as not focusing on it at all.

But at least we now more clearly understand where the breakdown is.

The breakdown, as I see it, is one department hoping to get away with blaming a vendor instead of doing their job, and management accepting that something was bought rather than their staff doing what they were brought in to do.

NetworkNerd

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

dafyre

@NetworkNerd said in Defining High Availability:

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

Yes. I saw this a bit in my last job. It took a major disaster for management to go oh... IT was right.

scottalanmiller

@NetworkNerd said in Defining High Availability:

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

That's basically true. But HA and insurance are both the same calculations, more or less. You have the same factors:

How much does the insurance cost?
What are we insured against?
How likely is the insurance company to actually pay out in the event of a disaster?
How much will we lose in the event of a disaster?

You don't do HA for a company that loses $500/hr just like you don't buy insurance for a car that only costs $1,000 to replace in cash. It's just silly. It costs to much and protects against nothing.

scottalanmiller

Adding an addition reference: https://searchdatacenter.techtarget.com/definition/high-availability

In information technology, high availability refers to a system or component that is continuously operational for a desirably long length of time. Availability can be measured relative to "100% operational" or "never failing." A widely-held but difficult-to-achieve standard of availability for a system or product is known as "five 9s" (99.999 percent) availability.

Since a computer system or a network consists of many parts in which all parts usually need to be present in order for the whole to be operational, much planning for high availability centers around backup and failover processing and data storage and access....

scottalanmiller

Wikipedia on High Availability: "High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period."

scottalanmiller

From Digital Ocean:

What Is High Availability?
In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Measuring Availability
Availability is often expressed as a percentage indicating how much uptime is expected from a particular system or component in a given period of time, where a value of 100% would indicate that the system never fails.

scottalanmiller

@NetworkNerd said in Defining High Availability:

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

In reality, insurance's purpose is to mitigate loses from a lack of availability. It's like an admission of bad availability (or a lack of faith in it.) By definition, insurance only kicks in when availability fails.

Jimmy9008

I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

Our development team probably calculate their up-time differently, but its all very interesting to me.

We are at 48 seconds of unavailability so far in 2019. Providing that stays the same for the rest of the year per quarter (48 seconds per quarter), how many 9's does that project for us? (Not really sure how to calculate that)...

So, how many 9's up-time would 192 seconds of downtime for a whole year be?

Dashrender

@Jimmy9008 said in Defining High Availability:

So, how many 9's up-time would 192 seconds of downtime for a whole year be?

365 (days) * 24 (hours) * 60 (min) * 60 (sec) = total seconds in 1 year =31,536,000

31,536,000 - 192 =amount uptime = 31,535,808

31,535,808 (actual uptime) / 31,536,000 (max uptime) = .9999939117 or in percent 99.99939117 % uptime

Dashrender

@Jimmy9008 said in Defining High Availability:

I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

Our development team probably calculate their up-time differently, but its all very interesting to me.

Uptime can be many different things.

The platform uptime, the application uptime, the internet connection uptime, etc.

Assuming you're providing a service to someone else - they only thing they care about is the uptime they have connecting to that service. So 404 aren't the only thing they care about. If your app is dead, yet the page doesn't display 404, it's still an outage to the end user.

I'm guessing for the most part, that's the one that primarily matters - so the fact that your team looks at only their bit - yeah, doesn't make the customer any happier.

scottalanmiller

@Jimmy9008 said in Defining High Availability:

I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

Industry standards are very general, hence why I tackled it around server numbers specifically. HA is used 99% (I made that up) by marketing, and only 1% by actual IT. IT needs to not use such a generality and must work with real numbers (X Nines) as goals. There is really no time that working with "HA" as a general concept works for IT, because it's a process not a product, and because achieving proper availability at cost is a sliding scale that we have to work with for everything.

So defining HA for a specific item (a server, wordpress, an ERP, etc.) is a case by case basis. Physical servers have a known industry standard, so an order of magnitude better (HA) or worse (LA) is easy to define. For anything software related, it is not so clear.

Then there is more to it, as well. If a standard server gets around five nines of availability. And HA is six nines, what if we need "in between" or "far more"? You can't work with a term like HA, you must define the "nines" and work with that.

scottalanmiller

@Jimmy9008 said in Defining High Availability:

My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

A 404 would be tough in that case. Because you might be calling something unavailable based on a bad request.

Example: My store must be open 24/7.

Problem (that a 404 represents): Customer went to the wrong address and didn't find my store.

scottalanmiller

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

Dashrender

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

.9999939117 or in percent 99.99939117 % uptime

AKA: Five Nines

Or more accurately, 5 Nines+

That extra "39" after your five nines is a significant improvement over five nines, but not close to six nines. I'd call it "really good" availability

And significant means the difference between 315.36 seconds of downtime vs your 192 seconds (5 mins 15.36 second vs 3 min 12 seconds).