Burned by Eschewing Best Practices

scottalanmiller

Same as in any redundant system. Two paired disks could fail in a mirrored disk array, two paired power supplies could fail on a host. Redundancy is only about reducing risk, not eliminating it.

Correct, redundancy is just a tool in the hopes of achieving reliability. Redundancy can reduce risk, it can also increase it.

A good example of where redundancy routinely reduces risk a lot is RAID 1 mirroring. It takes "almost certain to have data loss" of a single drive to "almost never have data loss" of a mirrored pair.

A good example of where redundancy itself routinely increases risk is dual SAN controllers in non-active/active arrays (most SANs that SMBs can afford) where each controller can fail and kill the other controller and almost never provide any protection during a real world failure.

scottalanmiller

@Carnival-Boy said:

Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.

This is not an IPOD (aka 3-2-1.) I believe that his intended architecture is a 3-2-2. This is not a good design, but not nearly as bad as an IPOD.

The issue here is that there is redundancy, yes, but there is redundancy only through adding points of failure that are not necessary. So while there isn't a SPOF, there is unnecessary complexity as well as extra failure domains - three instead of one. So while this design, if implemented well, can be very reliable, it can never be as reliable as not having the storage layer separate nor can it compete on cost. So it isn't risky, it is unnecessarily risky while wasting money, time and effort.

scottalanmiller

When would a 3-2-2 design actually make sense, since I said it wasn't horrible? When it is actually more like a 20-2-2. The point of this kind of design is for when reliability is important but nowhere near the top priority and cost savings at scale matters, which it almost always does in any large company. Once you get to enough physical servers attached to the SAN layer you start to see the ability to lower the cost of storage while making it "reliable enough" to make sense for the business at hand. So typically in an enterprise you might see hundreds or thousands of physical hosts in the "top" layer connected to many switches connected to a pair of big enterprise SANs (EMC VMAX for example.) This is never as reliable as not having the SANs at all, that just can't happen. But what it can be is quite a bit cheaper than not having the SANs and while not the best reliability, it can be pretty reliable to a point where that's not a problem.

The key is that at large scale this design can be cheap. That's why at small scale only local storage makes sense because not only is it the most reliable and the fastest, at small scale it is always the cheapest too.

scottalanmiller

@Carnival-Boy said:

Yeah, but cost isn't an issue as money is no object.

While I don't agree that this is ever true, even if cost is no object, SANs would never make sense since their only value is cost savings at large scale. If cost was never the goal or considered at all but only reliability and speed, that would drive us to bigger, better local storage only.

scottalanmiller

@DustinB3403 said:

What this means is that there are so many potential points for failure, and that in the most basic approach of the 3-2-1 the "reliability" isn't at all reliable, or is only as reliable as your weakest link, which is often the NAS (or SAN).

A better way to word and understand that is that in a dependency chain, which is what the dashes represent, you are always less reliable than your weakest link. It's not just that the SAN represents a weak point in the design, which certainly it does, you also have three failure domains. Two of them are much more reliable than the SAN, but they do present risk on their own and can fail. So your risk is not only the risk of the weakest point failing but of the combined risk of each of the layers.

Think of it think way, you have to roll a die three times (once for each domain.) If you don't get the number that you need, you lose your data. Ready.... go...

On the first roll, the SAN roll, you have to get a 4, 5 or 6. Basically you have a 50% chance of failure.

On the second roll and the third roll, you can get a 2, 3, 4, 5 or 6. You are still rolling and taking risk, but the risk of each roll is much less.

Just because a layer is very, very reliable doesn't mean there isn't risk in it and the risk of the layer is cumulative. So that is why adding layers, even when they are really reliable ones, introduces a negative value in regards to risk and why you only add them when there is a clear reason to do so (cost savings or whatever.)

scottalanmiller

@Carnival-Boy said:

But isn't this 2-2-2 and not 3-2-1? I'm still not getting it.....

I'm playing catch up here so seeing that you nailed this. Yes it is a 2-2-2 or "column" design, far better than a IPOD. But it is not nearly as good as a 2. He has six pieces of gear to fail in three failure domains. What he has is far better (in terms of reliability) than a single server if done well, but not nearly as good as two servers without the external storage. The external storage and the need for external networking in the middle of the storage and servers triples the failure domains without adding anything of value - in fact beyond the risk it only introduces cost, complexity and latency. No upsides, lots of downsides.

scottalanmiller

@Carnival-Boy said:

Yes, and it seems to be me that two hosts, two switches and two SANs (2-2-2) offers a decent level of redundancy without over-complicating the system. That's where I'm not getting where the "doom" is coming from.

Any unnecessary complication is over-complicated. You don't design a system to be "less than ideal" for no reason. The idea that the system isn't "terrible" is correct as long as you don't take into consideration the alternatives. He could have a system that is easier, safer, faster, simpler and cheaper. Why sacrifice all of those things just because something worse in all those ways is still "good enough?" You don't. You go for the clear win. His design, while "good enough" for nearly any scenario, only looks that way if you are dealing with raw numbers rather than the relative ones provided by other approaches.

To think of it another way - if you go to buy a car and you are going to buy a Ford Focus, would you be fine paying $80K for it? If you had no idea what cars cost and never compared other options, sure, a car is a miracle of engineering and over the life of a car you probably get more than $80K of value of it. So you would spend that money. BUT what if you knew that you could buy that car for $20K? Would you still be happy and recommend that someone spend $80K on it when you know that the market value is only $20K and you can go anywhere and buy it for that?

In one case the raw value to you of "a car" might be $80K. But that doesn't mean that you should pay that much when you have the option of getting the car that you need for far less. The $80K would be good enough if better options were not readily available. But given that they are, it's not a good decision.

In the case of IT, imagine that IT is like a car buying consultant. Would you be happy if your car buying consultant sold you an $80K Ford Focus because he knew you could afford it and that it was worth that much to you do have a car? Of course not, you'd say that he wasn't doing his job and looking for the best value. That's what is happening here. The IT guy's job is to not just know how to spend money but how to get good IT value without wasting money. But in this situation, the IT guy is delivering a system worth less but spending tons more on it.

DustinB3403

Another one for the pile, not nearly as bad as some others, but in this case the Hypervisor infrastructure was setup on a singular Spinning Rust drive, which when it failed killed the entire Hypervisor host.

In this case though, they have a XenPool so recovery should be simple enough, install a new drive and reconfigure the host, lastly rejoin it to the pool.

But had they configured the host from a USB drive, they could simply install the backup USB drive and be up and running in a matter of moment.

DustinB3403

To boot on the above's system the designer left the battery backup off of their iSCSI storage unit which houses all of their VM's.

"...........ugh..."

DustinB3403

Now quite sure where this post belongs here so I'm placing it here.

An IT person is looking to setup a hypervisor setup, with Server 2011 SBS, and looking for advice, but doesn't appear to be looking at actual business needs and weighing the options.

Just going off of some information he heard from somewhere.

MattSpeller

This is a good thread & I like it, but I would be curious to see it's opposite as well - Burned by BP

scottalanmiller

@MattSpeller said:

This is a good thread & I like it, but I would be curious to see it's opposite as well - Burned by BP

Ever seen that happen? If it is even possible, then it can't be a BP.

MattSpeller

@scottalanmiller if doing windows updates is best practice, that's an easy one

I was thinking server configs that conflict or other goodies like that

scottalanmiller

@MattSpeller said:

@scottalanmiller if doing windows updates is best practice, that's an easy one

Don't confuse burned BY a best practice with being burned IN SPITE of a best practice. Not the same thing.

MattSpeller

@scottalanmiller I don't really see much distinction in that difference but I get the gist of what you mean

scottalanmiller

@MattSpeller said:

@scottalanmiller I don't really see much distinction in that difference but I get the gist of what you mean

It is a HUGE distinction. It's like wearing your seatbelt. You can still get killed but it doesn't tell you that wearing a seatbelt was a bad idea. It still was the safer bet, even if it doesn't always save you.

scottalanmiller

Nothing protects you 100% of the time. There is always risk. Best Practices only exist when they reduce the risk or universally make sense. They can only reduce risk, though, not make it go away. So if it is truly a correct Best Practice sure, it might be wise to understand that they do not protect you 100%, but it is misleading to think of it as being burned BY the best practice which would lead to people saying that they are avoiding doing BPs because of these risks that they present. If anything hints at such behaviour, it is the wrong thinking.

scottalanmiller

Hopefully we managed to head this guy off at the pass. http://community.spiceworks.com/topic/1236071-postgresql-on-ad

Installing PostgreSQL on an already overloaded Active Directory servers! Linux VMs are not just better for this, they are free!!

DustinB3403

The topic is all kinds of backwards, he offers nothing of what his goal is, just want he wants to do. Has no concept of how everything should work, and is just extremely ill informed.

Does he even have a hypervisor to run VM's on? I'm guessing not and that he's wanting to "make-do" with what he has at the companies disposal.

Which doesn't seem like much.

Jason

What RDS/Terminal Server on a DC.. Worst thing you could do.