Why Dual Controllers is Not a Risk Mitigation Strategy Alone
-
It has become a mantra of storage salespeople to say that storage devices with dual controllers are high availability and, more or less, magic black boxes that won't fail. Nothing, of course, could be further from the truth. Common sense as well as observation say that not only do they fail, they fail more often than even a comparable level server would fail. Why? Let's investigate.
First of all let's set the stage. The idea with most storage in these discussions is that it is used to back VMs or physical servers and so is an additional point of failure and must make up for that by being insanely reliable on its own. SANs come in a ridiculous variety from the silly Netgear SC101 on the low end to things like EMC VMAX on the high end exactly the same way that computers come in the form of a Raspberry Pi Zero or the IBM z series mainframes. While both are SANs and both are computers, they are very, very different animals.
There are many facets to this discussion in general but here we will look at the dual controller issues specifically. But here are the quick basics...
- Normal servers have dual "everything" if you want them to.
- Servers at the same "tier" as storage have the same redundancy, including controllers.
- Servers sell at high volume and get more testing than their storage counterparts.
- Having two of something and having them be high availability are not the same thing.
So let's break this down.
Dual controllers just means that there are two, it does not mean that one takes over when the other one fails. It's that simple. The assumption is that this is why two controllers are present, but that is a marketing gimmick. There are two controllers because it raises the price and makes it easy to get people to buy, not because it provides a demonstrable value.
In high end, mainframe class SANs like those available from EMC, HDS and HPE 3PAR offer what we call active/active controllers or "highly decoupled" controllers. The controllers are independent, do not share firmware, are both seeing the disks all of the time and essentially have no dependencies on each other. Because high end SANs offer this, it makes it easy to lump all dual controller SANs into this category but this isn't how things work any more than your Raspberry Pi does not have dual failover motherboards like your HPE Integrity does. In a high end, mainframe class SAN like EMC VMAX you indeed get high availability in a single SAN chassis. You also pay for it.
It cannot be missed that if you buy an EMC VMAX, as an example, you could have purchased one of several similar classed servers from companies like IBM, HPE and Oracle, that are servers with the same high reliability in a single chassis. So the dual controllers of the SAN in this range still do not increase the reliability in respect to servers of the same class. So even here, the dual controller system is to maintain reliability at its tier, SANs are never more reliable than their server compatriots at a given tier.
When we drop to the next tier down of storage, most SANs continue to come with dual controllers in a single chassis. However this are not active/active controllers and are instead systems like active/passive and are highly coupled. The tight coupling is where we start to see issues. The tight coupling means that the controllers have some amount of dependency upon each other - often rather a lot of it. This is often in the form of firmware dependencies but can be electrical or mechanical, as well. These systems often perform flawlessly in non-failure modes such as in demos where one controller is removed from the chassis so they look great in showrooms. But in the real world when a controller fails the overall system failure rate is extremely high and since there are twice as many controllers to fail as necessary the chance of there being "a" failure is much higher than if there was only a single controller.
Tightly coupled dual controller SANs are probably the riskiest form of storage device. Because the controllers tend to "shoot each other" and because they have often high failure rates on firmware patching they, from long observation across many product lines, are dramatically less reliable than a normal server or even a cheaper single controller SAN device.
And, of course, there remains the risks of a shared chassis and backplane and a single disk array. Those are major components to share.
And there is "HA compatibility". It is not uncommon for a storage device with dual controllers to fail over from one controller to the other successfully - according to the storage device itself but for systems connected to it to fail because the failover happened too slowly. The speed of failover is often overlooked in these scenarios and systems can fail due to the dual controllers while the vendor gets to report that no failure took place. A tricky reporting strategy.
Dual controllers sound wonderful but as a concept on their own, they are meaningless. How they are implemented and how they are intended to work matter greatly. By and large, dual controllers are a negative as they are costly and risky and exist only as a marketing strategy to make SANs have a degree of plausible "magic" or to give IT buys plausible denyability that they felt confident that the system could not fail. But the reality is that buying a non-active/active dual controller SAN is the riskiest storage move.