Defining High Availability

scottalanmiller

High Availability is one of those terms that gets thrown about rather carelessly, especially in IT circles. As are concepts like "nines of availability." Often we say these things, and even more often businesses demand them, without a clear idea of what they even mean.

There are two components to the term high availability (or HA, as everyone calls it), one is "high", which we need to define and the other is availability, which we also need to understand. Let's start with the later.

Availability refers to what we often call uptime, or the amount (as a percentage, generally) of time when a particular service is available for us to use. This is normally done against production hours and does not include planned maintenance time, typically. We might use percentage of time when a service is available, expressed in a percentage number of nines, to express this availability such as being available 99%, 99.9% or 99.99% of the time. Each extra representing an "order of magnitude" better reliability or availability than the level before it. We can also simply this to three, four or five nines of availability, for example.

Availability can become confusing because it refers to different things at different levels. To a storage engineer, availability might be the amount of time that the block storage from the SAN remains available, regardless of if the servers attached to it have failed or even exist. To a platform engineer, it might mean that the hypervisor has remained functional regardless if the SAN that it is attached to still works or if the VMs running on it are working. To a systems administrator, it would mean that the OS is functional, but they might not care if applications are still running. To the business, they don't care if anything under the hood still works, only that the resulting services remain working for the end users. So every department and layer has its own perspective on availability metrics and definition.

John Nicholson: High Availability is something that you do, not something that you buy.

Because HA is something that you deliver as an organization, there is no means of buying a single product to enable this functionality.

High, Standard & Low are ways to define amounts of availability in more general terms than those measured by a number of nines. We do this by simply determining a baseline and then identifying when a system is at least an order of magnitude more (or less) reliable than the baseline. Unlike a nines metric which gives a stable range of availability, the idea of high and low only gives us a relative reference. In many cases, however, this is the more valuable tool as it adjusts usefully with the scenario.

In IT we most often talk about servers in reference to availability. For Standard Availability (or SA) which is our baseline, we typically work from the reliability of a single enterprise commodity server such as the HPE Proliant DL380 G9 or the Dell PowerEdge R730. These are very comparable servers and represent the most mainstream servers on the market and fit perfectly into a reliability graph as the top of the bell curve. You can get better (HPE Integrity) for example, or worst (whiteboxing your own server) or change architecture (IBM Power) but these are both the mean and the median of the industry.

We don't need to know exactly how reliable a baseline server is, in fact we can't as environmental factors play a significant role in determining this. An abused server might have only three nines, one in a great colocation facility might have six. Service SLAs, RAID choices, hot swap parts, part replacement policies and more have dramatic impact on availability metrics. But they don't affect relative reliability.

So, using these systems as a baseline, high availability refers to servers or computational systems that result in at least one order of magnitude more availability than one of these servers will do on its own under the same conditions, and low availability is about one order of magnitude of reliability less than this baseline would produce under the same conditions.

HA (or LA) is simply a differential versus the baseline, it is no way implies that a product was purchased, that an HA branded component is used, that redundancy is employed or that any specific implementation is leveraged. HA is based on the results (of calculated risk, single systems cannot be observed for outcome) not on the means. While in some arenas it may be common to achieve HA through the use of failover, scale out technology, in other it is achieved through reliability improvements to a single system (the mainframe approach) and in others it may be achieved through environmental improvements.

This approach gives us a consistent, logical and, most importantly, useful set of standard terminology that we are able to use time and again to express our needs and values in our system design and architecture.

dafyre

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

scottalanmiller

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

Not that I have seen, very much the opposite . SMBs seem to focus on the availability of a single layer, often one chosen at random, without even considering the availability of the entire system.

For example, they tend to focus on layers like platform where products with an "HA checkbox" exist but ignore layers on which they depends (storage, power, WAN, cooling, etc.) or layers that depend on the platform (OS, application, LAN) so that a large investment in "HA products" is lost when the other components above and below the layer with HA features don't support HA through to the end users.

dafyre

@scottalanmiller said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

Not that I have seen, very much the opposite . SMBs seem to focus on the availability of a single layer, often one chosen at random, without even considering the availability of the entire system.

For example, they tend to focus on layers like platform where products with an "HA checkbox" exist but ignore layers on which they depends (storage, power, WAN, cooling, etc.) or layers that depend on the platform (OS, application, LAN) so that a large investment in "HA products" is lost when the other components above and below the layer with HA features don't support HA through to the end users.

IE: Buying a two node, fully replicated SAN, but not buying redundant UPSes and Network switches?

scottalanmiller

@dafyre said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

Not that I have seen, very much the opposite . SMBs seem to focus on the availability of a single layer, often one chosen at random, without even considering the availability of the entire system.

For example, they tend to focus on layers like platform where products with an "HA checkbox" exist but ignore layers on which they depends (storage, power, WAN, cooling, etc.) or layers that depend on the platform (OS, application, LAN) so that a large investment in "HA products" is lost when the other components above and below the layer with HA features don't support HA through to the end users.

IE: Buying a two node, fully replicated SAN, but not buying redundant UPSes and Network switches?

Exactly.

Dashrender

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

dafyre

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

Exactly. Critical thinking is what lead us to purchasing an Active/Active 2-node SAN Cluster along side our VMware infrastructure at my last job...

Q: "What happens if we lost power to this building for 5 minutes?"
A: We're fine, we have UPSes.

Q: "What happens if we lost power to this building for 5 days?"
A: We're screwed unless we buy a backup generator.
BOSS: We have an underutilized generator in another building.
IT: Can we get a room in that building?

scottalanmiller

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

That's the same as not focusing on it at all.

Dashrender

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

That's the same as not focusing on it at all.

But at least we now more clearly understand where the breakdown is.

scottalanmiller

@Dashrender said in Defining High Availability:

@scottalanmiller said in Defining High Availability:

@Dashrender said in Defining High Availability:

@dafyre said in Defining High Availability:

Would it be correct for me to assume that most SMBs focus on the overall Availability of an entire system (whether that be a single server, or multiple servers) ?

See, I was going to say - Yes - they focus on the overall availability, but do so not understanding there is more than one layer and only focus on the layer with the checkboxes - lack of critical thinking.

That's the same as not focusing on it at all.

But at least we now more clearly understand where the breakdown is.

The breakdown, as I see it, is one department hoping to get away with blaming a vendor instead of doing their job, and management accepting that something was bought rather than their staff doing what they were brought in to do.

NetworkNerd

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

dafyre

@NetworkNerd said in Defining High Availability:

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

Yes. I saw this a bit in my last job. It took a major disaster for management to go oh... IT was right.

scottalanmiller

@NetworkNerd said in Defining High Availability:

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

That's basically true. But HA and insurance are both the same calculations, more or less. You have the same factors:

How much does the insurance cost?
What are we insured against?
How likely is the insurance company to actually pay out in the event of a disaster?
How much will we lose in the event of a disaster?

You don't do HA for a company that loses $500/hr just like you don't buy insurance for a car that only costs $1,000 to replace in cash. It's just silly. It costs to much and protects against nothing.

scottalanmiller

Adding an addition reference: https://searchdatacenter.techtarget.com/definition/high-availability

In information technology, high availability refers to a system or component that is continuously operational for a desirably long length of time. Availability can be measured relative to "100% operational" or "never failing." A widely-held but difficult-to-achieve standard of availability for a system or product is known as "five 9s" (99.999 percent) availability.

Since a computer system or a network consists of many parts in which all parts usually need to be present in order for the whole to be operational, much planning for high availability centers around backup and failover processing and data storage and access....

scottalanmiller

Wikipedia on High Availability: "High availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period."

scottalanmiller

From Digital Ocean:

What Is High Availability?
In computing, the term availability is used to describe the period of time when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

Measuring Availability
Availability is often expressed as a percentage indicating how much uptime is expected from a particular system or component in a given period of time, where a value of 100% would indicate that the system never fails.

scottalanmiller

@NetworkNerd said in Defining High Availability:

Does anyone think that business leaders have tendencies to think investing in improving the availability of a service (whatever that entails) is just nothing more than buying insurance?

In reality, insurance's purpose is to mitigate loses from a lack of availability. It's like an admission of bad availability (or a lack of faith in it.) By definition, insurance only kicks in when availability fails.

Jimmy9008

I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

Our development team probably calculate their up-time differently, but its all very interesting to me.

We are at 48 seconds of unavailability so far in 2019. Providing that stays the same for the rest of the year per quarter (48 seconds per quarter), how many 9's does that project for us? (Not really sure how to calculate that)...

So, how many 9's up-time would 192 seconds of downtime for a whole year be?

Dashrender

@Jimmy9008 said in Defining High Availability:

So, how many 9's up-time would 192 seconds of downtime for a whole year be?

365 (days) * 24 (hours) * 60 (min) * 60 (sec) = total seconds in 1 year =31,536,000

31,536,000 - 192 =amount uptime = 31,535,808

31,535,808 (actual uptime) / 31,536,000 (max uptime) = .9999939117 or in percent 99.99939117 % uptime

Dashrender

@Jimmy9008 said in Defining High Availability:

I find this all very interesting. Anywhere to read more in depth on industry standards surrounding this?

My team base availability on HTTP/S error codes. If a code comes back, say 404, then we consider that unavailable. If the page loads, but the site does not function because our development team messed a release up, as long as it is not an error such as 404, we consider we are available.

Our development team probably calculate their up-time differently, but its all very interesting to me.

Uptime can be many different things.

The platform uptime, the application uptime, the internet connection uptime, etc.

Assuming you're providing a service to someone else - they only thing they care about is the uptime they have connecting to that service. So 404 aren't the only thing they care about. If your app is dead, yet the page doesn't display 404, it's still an outage to the end user.

I'm guessing for the most part, that's the one that primarily matters - so the fact that your team looks at only their bit - yeah, doesn't make the customer any happier.