Risk: Single Server versus the Smallest Inverted Pyramid Design

scottalanmiller

This comes up so often that it is worth having a risk analysis for this one scenario.

Scenario: Two servers and a single NAS or SAN as shared storage. High Availability solution applied to the servers so that if one server dies the other can immediately spin up the failed VM. (This is a 2+1 Inverted Pyramid design.)

Assumptions: We assume the following...

Server means an enterprise class commodity server, a standard server like the HPE DL380 or Dell R730. Anything in that general category of reliability.
NAS or SAN is a comparable price range unit which would include unified storage offerings from companies like Synology or ReadyNAS as well as dedicated SAN from Dell and HPE.
The logic in considering the two plus one IPOD design is because the "high availability" checkmark was required.
That the baseline for standard availability is defined as the level of availability associated with a standard enterprise server.

Now in this example we can represent the standard risk presented by a single server solution as X. That would be if we had a single server with no second server and no external storage. This is our baseline and is "standard availability." There is only a single failure domain to consider so this is very simple.

Now let's determine the risk of the 2+1 IPOD solution. We have two failure domains, the server tier and the storage tier. We have three devices, each is roughly equivalent with a risk of roughly X. (See below.)

So our storage tier has a risk of X, that's simple. There is no mitigation for the risk there, it is what it is.

The server tier has two servers that each have a risk of X but we mitigate this with hypervisor or application layer high availability technologies. These are not perfect but we assume that they are effective. There are many possible risks here, that the HA layer will fail, that the workloads will not be consistent, that the applications will behave badly and, the big one, that the second host will fail while the first one is down. Even with all of these factors together the risk at this layer is a tiny fraction of X. We will call this risk Y where Y is a positive risk number greater than zero but far closer to zero than to X. What is important is that Y is less than X but not zero.

Now the risk of the two failure domains must be combined because each tier must be fully functioning or the entire system has failed. If the server tier fails the storage is useless. If the storage tier fails the servers are useless. So we have a dependency chain of risks.

So the risk here is X + Y. We don't know what Y is, but what is important is that the risk of the resulting system is a number greater than X. It doesn't matter how risky X is, it doesn't matter how small Y is, the resulting risk figure is riskier than X by a tiny or potentially somewhat large amount.

If risk were our only factor, this would be not that far from a break even. The single server design would still win on being less risky, easier to manager, more performant and a host of other reasons. But one factor that we are never without is cost. Cost is a form of risk itself that can never be ignored. If cost is no object then by extension risk is no object either (risk is measured in financial terms, after all.)

In our example, one system has a single server. The other has three servers. It is plausible that the cost of the storage unit would be less than the cost of one of the server nodes, surely. Far more likely the cost would be higher. Even if the cost were zero, the theory would remain strongly true but for the sake of the example we will assume the average and say that it costs the same as one of the other servers. The cost of the inverted pyramid design comes in at an average of 300% the cost of the single server solution while being more risky. This is huge. In the real world this would vary from something like 250% on the low end to like 600% on the high end.

So at the end of the day, we spend 3x what we should just to have more work to maintain the system and to take on extra risk without benefit.

Many people like to think or claim and vendors will happily feed into this belief, that storage devices, especially SANs, are "magic" and not subject to the same risks as normal servers even though they are just servers themselves. A typical unified storage device is built on SuperMicro servers or similar which puts the risk profile nearly identical to that of a standard enterprise server. This is, indeed, a standard server in every way.

It is very common to look to dedicated SANs that are slightly more expensive as also being infallible sometimes simply because of the name SAN (which is only a reference to its network protocols) and sometimes because of the assumption that dual controllers will break the rules of risk and make the single device effectively riskless. Dual controllers are not used in standard enterprise servers for a reason - they generally add no value and often create additional risk. Indeed in non-active/active controllers (the only ones that can be remotely considered in this price range and scale) dual controllers are shown to routinely create disasters far more often than standard servers fail.

Most storage devices in this range also lack the support options that enterprise servers do. This is not a problem built into the solution type but into common choices in the approach to this layer so should not generally be calculated as a rule. But it is very common to see people say that they must avoid five minutes of downtime at the server layer but will select a storage device with a two week SLA on repairs - and will not be returned with the data intact!

Because of the combinations of lower production rates, less testing, dual controller induced failure, special case software and more the storage layer is actually normally quite a bit more risky than the server layer even with comparable hardware. So we are actually overly generous to the IPOD solution approach by calling this risk X, when in fact it is generally quite a bit higher, possibly 2X or more!

It should be noted that it is possible to move up to very high end, very expensive storage devices that will mitigate a large portion, but never all, of this risk. But in a 2+1 design the cost would normally double or more the entire product cost and are effectively unthinkable as far better risk mitigation strategies can be done that are both less risky and far less costly in other ways.

dafyre

@scottalanmiller said:

Assumptions: We assume the following...
NAS or SAN is a comparable price range unit which would include unified storage offerings from companies like Synology or ReadyNAS as well as dedicated SAN from Dell and HPE.
The logic in considering the two plus one IPOD design is because the "high availability" checkmark was required.
That the baseline for standard availability is defined as the level of availability associated with a standard enterprise server.

Question about the assumptions... Does this assume a single SAN node, and not at least two nodes in an active/active (or active/passive) cluster?

scottalanmiller

@dafyre said:

@scottalanmiller said:

Assumptions: We assume the following...
NAS or SAN is a comparable price range unit which would include unified storage offerings from companies like Synology or ReadyNAS as well as dedicated SAN from Dell and HPE.
The logic in considering the two plus one IPOD design is because the "high availability" checkmark was required.
That the baseline for standard availability is defined as the level of availability associated with a standard enterprise server.

Question about the assumptions... Does this assume a single SAN node, and not at least two nodes in an active/active (or active/passive) cluster?

Yes, if you had two SANs that would not be an inverted pyramid, it would be a tall column.

ntoxicator

@scottalanmiller

Thank you for this post.. I have a hunch as to why you posted this.

scottalanmiller

@ntoxicator said:

@scottalanmiller

Thank you for this post.. I have a hunch as to why you posted this.

Wasn't you, this was proposed several times last week

travisdh1

@scottalanmiller said:

The singer server

Typo spotted, 10th paragraph. That's supposed to be single instead of singer? (Unless maybe a fan is going bad :P)

scottalanmiller

thanks, fixed.