Inverted Pyramid of Doom - Explained



  • So let's say you are looking at purchasing a new server footprint to replace everything you have today.

    You might call your Rep and say "Hey I need some servers" (without providing any detail on how you want to configure or use the servers. )

    They might and often do come back with an system design like the one below. Because they know you have no clue what you need, and that you're an IT Buyer.

    Youtube Video

    0_1517346589262_IPOD-Defined.png

    Now in the above picture there is a blue Triangle, the parts that are important here are the Storage Server, the Lab Switch and the Physical Hosts 1 and 2.

    If we break this image up into its component pieces we get this.

    0_1517346637830_individual-sections.png

    What your rep will tell you is that you can lose an entire server and still be operational. And while this may be true, it is only situationally true. As you're looking at the server design from the top down (the base which is upside down).

    So in the above image, Physical Host 1 has failed, and everything will migrate over to Physical Host 2. (indicated by the red X and arrow)

    • You can only lose either physical host 1 or 2
    • If you lose the storage server, your physical hosts are useless.
    • If you lose the switch, everything is inoperable.

    The whole thing looks like an inverted pyramid (of doom). You must look at the system from the side to see this vunerable design. Otherwise as an uninformed buyer, you see something that looks reliable, but you don't see the weakness of the design.

    Now what you'll hear, repeated over and over again is that SAN's are super reliable, unfortunately this often isn't the case with the systems being sold in this design. These units by themselves offer standard availability (99.9% - wiki website explaining High Availability), but when combined together increase the dependency chain of the system, and lowering the overall availability of the system as a whole.

    So how do you address this? Well you remove the SAN and the Switch from the picture entirely and use on-host storage and Hyperconvergence software solutions.

    StarWinds has one such solution, but more importantly to this conversation they have a ton of documentation on it, explaining how to avoid getting put into a situation that you think of reliable and safe when in fact it is less reliable (dependency chain) and more more expensive (more hardware to buy).

    Fault Tolernance and High Availability and Replicated Local Storage aka Virtual SAN



  • So if you wanted to get out of this setup, you can jump to the column design, literally duplicates of all hardware required to run the systems.

    As shown here:
    0_1517347145190_column-design.png

    Or you go with a vSAN approach, remove the "Storage Server" and put the storage within your physical hosts 1 and 2.

    vSAN approaches, generally scale on demand (really as you provide resources) but scale on demand sounds nicer.

    So why would you want to use a column design? There are a few use cases, namely unlimited budget, extreme uptime requirements, limited support for vSAN options compared to hardware support that may be available.

    The choices are many, ESXi has a vSAN approach, I know Dell also sells a system (not cheap) as well as @scale with the KVM hypervisor and the complete toolstack that comes with it.



  • Also XenServer (and soon to be XCP-ng) have a vSAN type of approach as well.

    I'd ask @olivier to explain how it all works though.