Most Customers Do Not Experience Failure

  • Far too often I hear a vendor, or worse someone actually working in IT, that uses the it worked for me, or tons of customers are happy or most people never experience a problem sales tactic to push or justify a solution. This is either a thinly veiled attempt at deception or an indication that the vendor has no concept of risk. Either situation is scary.

    By definition, of course, we want to keep risk small. And in nearly anything in the real world that we undertake, risk is pretty small. The chances of slipping and falling to the point of injury in a bathroom in the US is very roughly .1% each year. That's a high risk, the riskiest place in the house, even riskier than taking the stairs. This risk is so high that we need to take special cautions when in the bathroom and often do, applying non-slip gribs to bathtubs and adding handles to the walls and so forth. But even with this high level of risk, 99.9% (three nines) of people in the US will not sustain a bathroom related injury this year.

    Risk does not imply that something bad is going to happen, nor that it is more likely to happen than not. A risky situation simply means that the likelihood of something bad happening is higher than we are comfortable with.

    Let's take Russian roulette. Assume that you have six players, six chambers, and one bullet. Each player takes a turn and after their turn the gun is passed to the next player. The overall risk is 1 in 6 or 16.7% chance of failure. This means that 83.3% of players will not lose (or will win, depending on your view.) The vast majority of Russian roulette players will not die in any given game. Yet we still consider the risk to be insanely high.

    Vendors, however, will often cite that "most" (which means 51% or higher) of customers do not experience catastrophic data loss, for example. As if "most" is somehow an acceptable standard. In RAID recovery, for example, we normally consider anything less than three nines of reliability during a resilver operation to be unacceptably low and four nines to be more of a standard goal. When a failed disk has been replaced, failure to recover should be less than one in every one thousand array recoveries.

    SAN is a place where we see this often (and we saw this this week.) With anecdotal evidence showing that there may be SAN-induced outages or dataloss impacts nearly 50% of all customers, the vendor does not respond with reasonable risk numbers but simply states that "most" customers (51%) have not reported those issues. That's a staggering statement when we really think about it. When buying a high cost, fully "redundant" SAN, most customers believe that they are buying reliability into the five and six nines ranges, but the vendors may be selling them figures so low that by any normal standard we'd simply call them "non-working."

    49% failure rate is a far, far cry from .0001%. Imagine if 49% of seatbelts did nothing! Or if 49% of brakes just, didn't brake. Or if 49% of RAID recoveries failed. Or if 49% of trips to the shower resulted in a hospital trip.

    49% failure rates are crazy. Even 1% failure rates are crazy in nearly all cases.

    Are your vendors pushing you to play Russian roulette with your data?

