Burned by Eschewing Best Practices
-
@Carnival-Boy said:
Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.
Sure there is complexity in there. But there's complexity with DAGs and file syncing as well.
Sure they do at what cost though? Is the cost worth it when you could get the same functionality for 10's of thousands of dollars less?
I've run into the dual controllers are totally going to save the world sales person before, which one of my former managers bought into. Turns out they share the same backplane and interaction with the drives. So when one controller locked up due to a firmware issue the other controller wouldn't take over. Everything went offline for ~8 hours until they overnighted us a new controller and we updated the firmware on both of them.
-
Yeah, but cost isn't an issue as money is no object.
-
@Carnival-Boy said:
Yeah, but cost isn't an issue as money is no object.
Any business that says that is just trying to fail! The whole point of a for profit company is to make money, and they should be doing so with smart spending.
-
I'm not disagreeing, but if the OP says money is no object then you should treat that as fact. Maybe he has a magic money tree. Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point. The point I'm trying to understand is why dual SANs and dual switches equals this pyramid of doom thing.
-
It simply creates a larger pyramid, with more parts, which makes the entire system way more complex to troubleshoot, and fix should something happen.
It doesn't force the system to be less reliable when compared to the standard 3-2-1 model, as you are in fact creating a level of redundancy by implementing a 2nd SAN to backup the first.
But it's just wasteful in most cases.
-
@DustinB3403 said:
It simply creates a larger pyramid, with more parts, which makes the entire system way more complex to troubleshoot, and fix should something happen.
It doesn't force the system to be less reliable when compared to the standard 3-2-1 model, as you are in fact creating a level of redundancy by implementing a 2nd SAN to backup the first.
But it's just wasteful in most cases.
Basically this. You aren't any more reliable then the dual host scenario and you've introduced several more layers of potential failure to your system.
There is a point where this makes sense... but not at 6 servers and two physical hosts. I'm not sure where the tipping point is but probably at the hundreds of virtual servers mark.
-
@coliver said:
You aren't any more reliable then the dual host scenario and you've introduced several more layers of potential failure to your system.
You are actually mathematically substantially less reliable with that setup, at least from a hardware failure perspective.
-
You got any facts to back that up? I find it extremely difficult to evaluate reliability. Anyway, you can't just judge it from a hardware failure perspective, since we're comparing hardware redundancy versus software redundancy (eg DAGs, file syncing). Both are complicated. Both require expertise to administer and both are risky.
-
@Carnival-Boy said:
I'm not disagreeing, but if the OP says money is no object then you should treat that as fact.
I don't agree. Knowing someone is wrong, confused or doesn't understand something is exactly when they need help most, not the least. Tons and tons of what we do in IT is recognizing when people don't know what they need to know and helping them. In a case like this where we know they have to be wrong and don't understand what they are doing, should we really help them hurt themselves?
I totally get that this goes against my "always give people the benefit of the doubt" theory about never hurt the innocent to protect the guilty, but this is a case where money is never no object, it's simply not true, and it means someone desperately needs help and don't understand that they don't know.
-
@Carnival-Boy said:
Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point.
That would actually make money the ONLY object. Budget would be the whole concern, not just part of it.
-
Well, ok, but the OP isn't' actually on ML so it's a moot point. What I'm really interested in is what the problem is with his solution (ignoring the financial cost) and why it is one of your inverted pyramid thingies. I'm not arguing, I just don't understand and want to learn.
-
An IVPD looks stable and reliable when looking at it from the top, you have a bunch of equipment that supposedly will fail over between the devices.
But what isn't obvious from it, is if you look at it from the side you have individual layers of equipment, which is dependent on everything above or below its self.
So in the most simple example 3-2-1 you have 1 NAS(or SAN) 2 Switches and 3 Servers.
The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)
What this means is that there are so many potential points for failure, and that in the most basic approach of the 3-2-1 the "reliability" isn't at all reliable, or is only as reliable as your weakest link, which is often the NAS (or SAN).
Because if any part of that chain breaks the whole system can and likely will come crashing down. Here's a really good explanation from the one and only SAM
-
In addition, and outside of what was brought up in the SW topic, is that there were likely many other Best Practices that were not followed by the OP on SW which lead to him getting burned, regardless of what Hypervisor his employer uses.
And the reason I say this is because the SWOP has stated he was already burned by Citrix Support, which seems very odd, as support hasn't designed the system to fail, but are trying to recover a failed system.
In summation the SWOP has a system that was improperly setup (likely by Eschewing Best Practices) for the benefit of Quick deployment, while not understanding how and why he got burned.
It has nothing to do with Citrix, unless Citrix saw the state of their system and how things were configured and said "Nope Nope Nope, we can't help you as everything you've setup is completely ignoring Best Practice Recommendations in its configuration, it has to be rebuilt." And "We won't support the system in this configuration."
Which is probably how the conversation went.
-
But isn't this 2-2-2 and not 3-2-1? I'm still not getting it.....
"The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device"
There is no single storage device here. Isn't it a "Tower of Redundancy" rather than a "Pyramid of Doom"? An expensive tower, but a tower. Or maybe a folly.
-
And what's the difference between "Inverted Pyramid of Doom" and the traditional term "Single Point of Failure (SPOF)", as in "a single SAN is a SPOF and therefore a bad solution. You need at least two for redundancy"?
-
This is still a IVPD, because the servers are dependent on the NAS(s), its an improved IPVD (if such a thing could exist) but there are many points that can fail.
Making it an overly complicated solution, and by design reduces the reliability of the system as a whole. Which includes recoverablity, stability and reliability.
-
A Single Point of Failure by its self won't bring the entire organization down.
Only that Point, and what it hosts is unavailable until it's fixed.
-
The best way to think of a SPOF is to take any single server, and unplug it. Without any other backup servers for these functions to migrate to.
That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.
-
@DustinB3403 said:
That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.
That's not my understanding of SPOF. In the context of the OP, the "system" contains various pieces of hardware (hosts, switches & SANs). If he lacks redundancy in one area of this system (for example, by only having one switch), then that piece of non-redundant hardware is a SPOF. In the pyramid analogy, it is the '1' in 3-2-1 that represents a non-redundant component and the '1' is the SPOF.
-
It still represents the same single point of failure. Any device (including a network switch, NAS, server, or network cable) that doesn't have a redundant "fail-safe" is a SPOF.