How Does Local Storage Offer High Availability

scottalanmiller

I get what you are saying about reliability. I think we are talking about two different types of reliability. You are speaking of a single device reliability ( a single server). I am thinking of perceived reliability -- the reliability of the whole system.

I'm talking about both. The reliability of the components of a system are the factors that lead to the resulting system reliability.

I can built a system with high availability with a single server or with a cluster of them. The one requires the individual server to be highly reliable itself, the other requires that multiple servers be able to fail to one another. Two different approaches, each with their own challenges. One is not inherently better than the other.

The problems arise when people start to assume that servers have a fixed reliability and therefore lead to a relatively fixed system reliability. But this isn't remotely true. An HP MSA has a very high single device failure rate while an Oracle M5000 or IBM z/90 have insanely low failure rates. I'll take a single M5000 over a cluster of cheap crappy servers any day.

scottalanmiller

@dafyre said:

But see, you are going for device reliability, I'm not arguing that. I agree that here, having redundancy does nothing to help the reliability of the individual devices.

No, I'm not. I'm talking about the system. Redundancy is never the goal. Never. The final goal is always system reliability. That system reliability might be achieved through highly reliability individual devices (the brick) or failover of less reliability redundant devices (marshmallows.)

I'm always talking about the resulting system reliability.

dafyre

@scottalanmiller said:

@dafyre said:

In case of using individual drives, RAID 0 is not redundant at all. Because the other drives are necessary for the RAID 0 to function. In the case of a single disk failure, RAID 0 becomes lost data.

In the case of drives it IS redundancy. You only need one drive. Now you have two or more. It is the DATA that is not redundant in RAID 0. The drives are very redundant. That's the difference. RAID refers to the drives explicitly, not the data on the drives.

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

scottalanmiller

@dafyre said:

Dictionary.com -- http://dictionary.reference.com/browse/redundant?s=t (check out part d)

Yes, there is an engineering definition of redundant that can mean that. But even using that definition, redundant does not mean what people think that it does. RAID 0 still has working drives even after the data is lost.

scottalanmiller

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

scottalanmiller

@dafyre said:

I have used these examples with you before. My SAN cluster appears to improve reliability at the top layer. In reality, one node blew out two drives last week, and since it was RAID 5, one node was down until we got new drives in it.

If we had been running on a single SAN device, then we would have been totally dead. However, since we had two that were fully replicated with automagic failover, nobody noticed a thing, and therefore our reliability appears to have increased because of the redundancy. The individual device reliability did not get better or worse, but it did have a failure.

Well, in that example, the cluster is improving reliability over just having a single SAN node. But the SAN itself is lowering the reliability. The redundancy of the dual SAN nodes must be increasing the system reliability. The SAN itself, lowering it. Then there is an additional question that we do not know of whether the redundant SAN (one positive, one negative) offsets having no SAN and no storage redundancy. Often it is just as good to do neither than to do both (but way cheaper.)

But the issues in your example are that you have a lot of pieces all affecting real and perceived reliability. In that case at least some of the redundancy is good, some we don't know.

dafyre

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

Dashrender

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

dafyre

@scottalanmiller said:

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

I meant to say why would you extricate drives and the data... what good are the drives without data?

scottalanmiller

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

dafyre

@Dashrender said:

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

You won't know how reliable your redundancy is until you have something fail.

scottalanmiller

@dafyre said:

I meant to say why would you extricate drives and the data... what good are the drives without data?

What if their job is just to be a cache and they can keep working just fine with a reduced drive count? Drive and data are not the same thing. While they are assumed to be associated, and certainly often are, they are different things. We can't just merge them, we lose the ability to talk about them individually.

scottalanmiller

@Dashrender said:

What good is redundancy if it's not reliable?

It's no good. Which means it would be crazy to ever seek redundancy instead of reliability.

scottalanmiller

@dafyre said:

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

I see. I guess it doesn't look reliable to people who know. If you say to an average person on the street "I have two servers and he has a mainframe, which is more reliable" I bet they'd say the mainframe because that's how people think. It's rare that people are used to "two cheap things" is better than "one good thing."

Like "I'll give you two cheap Bics in exchange for your $100 pen", most people would be like "your nuts, this will last forever."

dafyre

@scottalanmiller said:

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

^ Now I understand your thought process in this.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

If the building of redundancy leads to fragility, then something is wrong, IMHO.

wirestyle22

Semantics?

scottalanmiller

@dafyre said:

If the building of redundancy leads to fragility, then something is wrong, IMHO.

That's where it is debatable. Because the goal of an engineer would be reliability. But the goal of a salesman is sales. If the customer demands redundancy and not reliability, then the cheapest path to redundancy is the right one. But on the business empathy cap and it gets murky. Give the customer what they want is never wrong, right?

dafyre

@scottalanmiller said:

@dafyre said:

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

I see. I guess it doesn't look reliable to people who know. If you say to an average person on the street "I have two servers and he has a mainframe, which is more reliable" I bet they'd say the mainframe because that's how people think. It's rare that people are used to "two cheap things" is better than "one good thing."

Like "I'll give you two cheap Bics in exchange for your $100 pen", most people would be like "your nuts, this will last forever."

I'd take a few packs of cheap Bics. You can have my $100 pen. It might burst and start leaking tomorrow. The chances of all 20 or 30 Bic pens leaking and bursting tomorrow are slim.

scottalanmiller

@wirestyle22 said:

Semantics?

Semantics are one of the most important things in IT. This isn't a theoretical experiment in language, this is a real problem that plagues SMB IT every day. Go on Spiceworks and the average conversation around storage is someone being hoodwinked by this very bit of semantics. They request the wrong thing, they get what they ask for and they end up paying a lot and getting something negative.

scottalanmiller

@Dashrender said:

What good is redundancy if it's not reliable?

That's what this whole thread is trying to convey. That IT Pros should never be asking for redundancy as a goal. It is always resulting reliability. Always, no exceptions.

The issue is not that people are building reliability where it isn't useful, it is that people are demanding redundancy without reason. Given that redundancy is only a means to an end (or a proximate goal rather than a real goal) no one should request it, they need reliability. If redundancy provides that reliability, no problem. If magic fairy dust does, that's fine too.