How Does Local Storage Offer High Availability

scottalanmiller

Adding more components into the same chassis does not make it more reliable. It makes it a larger SPOF.

SPOF is a very dangerous term because it drives an emotional response and gets people to look away from the full system reliability. Adding more components into a single chassis may make it a bigger SPOF, it might make it more fragile or it might make it more reliable. Look at an EMC VMAX or an IBM z/90. They are SPOFS and you've never met someone whose had one die, ever. They run for decades. They use single device reliability instead of multiple device failover to achieve reliability.

That something is a SPOF isn't a problem. The only thing that matters is resulting reliability.

dafyre

@scottalanmiller said:

Just look at the HP MSA SAN devices. They are truly redundant by both definitions. Yet their redundancy loses reliability. So regardless of definition, redundancy on its own is never a goal, it's only a means to an end and never a given.

But see, you are going for device reliability, I'm not arguing that. I agree that here, having redundancy does nothing to help the reliability of the individual devices.

I am speaking hear to the appearance of reliability. The redundant system takes over when the main one fails, thus the system, overall, appears more reliable.

Here is the Cambridge Dictionary's definition of redundant. Nothing suggesting what you found on Google:

Dictionary.com -- http://dictionary.reference.com/browse/redundant?s=t (check out part d)

0_1455808295523_upload-057ff4ad-7bbf-4a8d-8575-a24230ea3a8a

scottalanmiller

@dafyre said:

I get what you are saying about reliability. I think we are talking about two different types of reliability. You are speaking of a single device reliability ( a single server). I am thinking of perceived reliability -- the reliability of the whole system.

I'm talking about both. The reliability of the components of a system are the factors that lead to the resulting system reliability.

I can built a system with high availability with a single server or with a cluster of them. The one requires the individual server to be highly reliable itself, the other requires that multiple servers be able to fail to one another. Two different approaches, each with their own challenges. One is not inherently better than the other.

The problems arise when people start to assume that servers have a fixed reliability and therefore lead to a relatively fixed system reliability. But this isn't remotely true. An HP MSA has a very high single device failure rate while an Oracle M5000 or IBM z/90 have insanely low failure rates. I'll take a single M5000 over a cluster of cheap crappy servers any day.

scottalanmiller

@dafyre said:

But see, you are going for device reliability, I'm not arguing that. I agree that here, having redundancy does nothing to help the reliability of the individual devices.

No, I'm not. I'm talking about the system. Redundancy is never the goal. Never. The final goal is always system reliability. That system reliability might be achieved through highly reliability individual devices (the brick) or failover of less reliability redundant devices (marshmallows.)

I'm always talking about the resulting system reliability.

dafyre

@scottalanmiller said:

@dafyre said:

In case of using individual drives, RAID 0 is not redundant at all. Because the other drives are necessary for the RAID 0 to function. In the case of a single disk failure, RAID 0 becomes lost data.

In the case of drives it IS redundancy. You only need one drive. Now you have two or more. It is the DATA that is not redundant in RAID 0. The drives are very redundant. That's the difference. RAID refers to the drives explicitly, not the data on the drives.

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

scottalanmiller

@dafyre said:

Dictionary.com -- http://dictionary.reference.com/browse/redundant?s=t (check out part d)

Yes, there is an engineering definition of redundant that can mean that. But even using that definition, redundant does not mean what people think that it does. RAID 0 still has working drives even after the data is lost.

scottalanmiller

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

scottalanmiller

@dafyre said:

I have used these examples with you before. My SAN cluster appears to improve reliability at the top layer. In reality, one node blew out two drives last week, and since it was RAID 5, one node was down until we got new drives in it.

If we had been running on a single SAN device, then we would have been totally dead. However, since we had two that were fully replicated with automagic failover, nobody noticed a thing, and therefore our reliability appears to have increased because of the redundancy. The individual device reliability did not get better or worse, but it did have a failure.

Well, in that example, the cluster is improving reliability over just having a single SAN node. But the SAN itself is lowering the reliability. The redundancy of the dual SAN nodes must be increasing the system reliability. The SAN itself, lowering it. Then there is an additional question that we do not know of whether the redundant SAN (one positive, one negative) offsets having no SAN and no storage redundancy. Often it is just as good to do neither than to do both (but way cheaper.)

But the issues in your example are that you have a lot of pieces all affecting real and perceived reliability. In that case at least some of the redundancy is good, some we don't know.

dafyre

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

Dashrender

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

dafyre

@scottalanmiller said:

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

I meant to say why would you extricate drives and the data... what good are the drives without data?

scottalanmiller

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

dafyre

@Dashrender said:

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

You won't know how reliable your redundancy is until you have something fail.

scottalanmiller

@dafyre said:

I meant to say why would you extricate drives and the data... what good are the drives without data?

What if their job is just to be a cache and they can keep working just fine with a reduced drive count? Drive and data are not the same thing. While they are assumed to be associated, and certainly often are, they are different things. We can't just merge them, we lose the ability to talk about them individually.

scottalanmiller

@Dashrender said:

What good is redundancy if it's not reliable?

It's no good. Which means it would be crazy to ever seek redundancy instead of reliability.

scottalanmiller

@dafyre said:

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

I see. I guess it doesn't look reliable to people who know. If you say to an average person on the street "I have two servers and he has a mainframe, which is more reliable" I bet they'd say the mainframe because that's how people think. It's rare that people are used to "two cheap things" is better than "one good thing."

Like "I'll give you two cheap Bics in exchange for your $100 pen", most people would be like "your nuts, this will last forever."

dafyre

@scottalanmiller said:

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

^ Now I understand your thought process in this.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

If the building of redundancy leads to fragility, then something is wrong, IMHO.

wirestyle22

Semantics?

scottalanmiller

@dafyre said:

If the building of redundancy leads to fragility, then something is wrong, IMHO.

That's where it is debatable. Because the goal of an engineer would be reliability. But the goal of a salesman is sales. If the customer demands redundancy and not reliability, then the cheapest path to redundancy is the right one. But on the business empathy cap and it gets murky. Give the customer what they want is never wrong, right?

dafyre

@scottalanmiller said:

@dafyre said:

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

I see. I guess it doesn't look reliable to people who know. If you say to an average person on the street "I have two servers and he has a mainframe, which is more reliable" I bet they'd say the mainframe because that's how people think. It's rare that people are used to "two cheap things" is better than "one good thing."

Like "I'll give you two cheap Bics in exchange for your $100 pen", most people would be like "your nuts, this will last forever."

I'd take a few packs of cheap Bics. You can have my $100 pen. It might burst and start leaking tomorrow. The chances of all 20 or 30 Bic pens leaking and bursting tomorrow are slim.