Redundancy is Never a Goal, Reliability is a Goal, Redundancy is a Tool



  • @art_of_shred said:

    I'm going to throw a wrench in here sideways. Reliability is definitely key, but isn't redundancy meant to ensure continuity and not reliability?

    System level reliability would be more or less the same thing as continuity. Can you be reliably working, that means you have continuity.



  • In the case of the plane example - you need have reliability in the pilots, one of the ways to get that is by having a redundant pilot.



  • @scottalanmiller When you say "System level" -- I assume you mean across the entire system... be that one server or ten.

    Am I right?



  • When you ask about continuity you have to ask about uptime - what is your uptime requirement? Let's assume you're using a Unitrends appliance, you can afford 15 mins of down time. Perhaps instead of buying two VM hosts replicated datat (or shared SAN/NAS, etc) you decide that you can spin up the image from the Unitrends box. That's not redundancy in my book, but gets your reliability objective.



  • @art_of_shred said:

    But, if one should suddenly have a heart attack, I really want to know that there is redundancy (co-pilot) to ensure that the plane stays in the sky.

    Is that really what you care about? I don't. I care that everyone survives unhurt. I don't care how it happens. If it is handled through having a co-pilot, that's fine. If it is handled by having the plane land itself, thats fine. It is it a magical hand that goes up in the sky and pulls the plane to safety, that's fine. Sure, a co-pilot is probably the easiest way to address this given current technology, but it's not a goal, it's a tool. If a better way exists, I expect them to use that. It's the reliability of the flight system (measured in reliably safe flights) that matters to me, if they achieve that with pilots, monkeys, computers, hamsters, magic... I don't care.



  • @dafyre said:

    @scottalanmiller When you say "System level" -- I assume you mean across the entire system... be that one server or ten.

    Am I right?

    Depends on the level that matters. The simplest way that most people think of it is "delivering a service to the end users" or "continuation of functionality". So if users need email, it would be the ability to deliver email services to customers that would be measured. Of course at the business level its "ability to communicate" not "gets email" but the interface between the business' theoretical needs and concrete implementation has to happen somewhere so that it is practical, defined and measurable. So once email is the agreed upon service, you would measure at the "availability of email" level to know what reliability the IT department is delivering.

    If servers fail, storage fails, ISPs go down, buildings burn, etc. don't matter to the system as long as email continues to work, however that happens. Basically look at the ends, not the means.



  • @art_of_shred said:

    ....but I'm still not entirely comfy with thinking that a really good pilot negates, or even reduces, the need for a co-pilot.

    The reason that you feel this way, and you are correct, is because you are already looking "under the hood" of the system and looking at "how" to make a system reliable yourself rather than looking at the reliability of the system itself.

    Let's take this into IT. A pilot is like a hard drive and an airplane is like a server. Would be implement a server without a minimum of RAID 1 redundancy to protect out storage? Nope, of course not. However, that's under the hood and something that we know is a practical, every day means of accomplishing reliability for this particular scenario - it's an implementation pattern. That's fine. That redundancy is never a goal does not suggest that it is not the most common tool to achieve the goal, just that it is really important not to mistake it for the goal. Just like two pilots is the most common pattern for achieving airplane reliability. There are other ways to do both of these things, but these two are such well established, effective patterns that it is insanely uncommon to deviate from them. But over time, new technologies or ideas might come forth that make these obsolete and other approaches might be safer, cheaper or easier to get better reliability and at that point we should instantly change how we do things because double hard drives or double pilots isn't a goal, it is simply a means to an end with the end being reliability or continuity or safety, however you want to look at that.



  • @Dashrender said:

    In the case of the plane example - you need have reliability in the pilots, one of the ways to get that is by having a redundant pilot.

    You can think of it as pilot level reliability, but that is based around the assumption that pilots are necessary. Handy for engineers to determine that pilots are necessary for plane level reliability and then focusing on the the pilot risks, but overall it is plane reliability, not pilot reliability, that the people riding in the plane, the investors in the airline, the government agencies auditing flight data care about. They want people getting up and down safely, if that can happen without pilots, great. It can't today, but employing pilots isn't the goal, it's just the best means to that goal currently.



  • @scottalanmiller said:

    @Dashrender said:

    In the case of the plane example - you need have reliability in the pilots, one of the ways to get that is by having a redundant pilot.

    You can think of it as pilot level reliability, but that is based around the assumption that pilots are necessary. Handy for engineers to determine that pilots are necessary for plane level reliability and then focusing on the the pilot risks, but overall it is plane reliability, not pilot reliability, that the people riding in the plane, the investors in the airline, the government agencies auditing flight data care about. They want people getting up and down safely, if that can happen without pilots, great. It can't today, but employing pilots isn't the goal, it's just the best means to that goal currently.

    yeah, the pilots are like your harddrive example.



  • Exactly. We all know that using hard drives for storage is practical and having them in RAID is the only practical way to make them safe in normal servers. Simple, proven, effective. But what if someone invented storage that was more reliable than hard drives in RAID 1 without needing redundancy of drives? Would we still use RAID 1 with hard drives? Of course not, redundancy isn't the goal, protecting the data and maintaining uptime are.

    Or what if new techniques come out like RAIN that work for our servers (say we have a bigger cluster like a two node HyperV cluster with StarWind or a Scale cluster) then suddenly RAID might not make sense because the "system" is bigger and RAID is no longer the best way to handle it.



  • @scottalanmiller That is where keeping your IT Skillset and knowledge up to date come into play. At my last job, I got sent to 1 training seminar over the course of ten years... My current IT knowledge was all built around things that I did at home, in my own time, when I had any to speak of.

    Just because the old way of doesn't something has a newer counter part doesn't mean the old way is necessarily shoved to the wayside right away.



  • @dafyre said:

    Just because the old way of doesn't something has a newer counter part doesn't mean the old way is necessarily shoved to the wayside right away.

    Actually I think Scott said exactly the opposite. Using the harddrive example. Today you do RAID 1 as a minimum, but if a new tech came out tomorrow that didn't require two drives (and assuming cost was the same or less) you should be moving to that new tech for new projects.



  • Both are true 🙂 @dafyre is correct that new doesn't always mean better. And @Dashrender is right that better supersedes "always used it" in value. But what I was saying is simply that we change based on the results and don't really care about new, traditional or any other under the hood artifact.



  • @scottalanmiller said:

    Both are true 🙂 @dafyre is correct that new doesn't always mean better. And @Dashrender is right that better supersedes "always used it" in value. But what I was saying is simply that we change based on the results and don't really care about new, traditional or any other under the hood artifact.

    I generally don't want new methods when it comes to building a server... I want what is tried and true. If somebody comes up with...say... a new FS that just magically saves all your data on all the computers at your employer (Oh wait... somebody look up Aetherstore!)... I don't want to jump to that right away... We'll let somebody else do the testing on it... and after it ha proven itself, then at our next server rollout, we can talk about it.



  • Just realized that this topic actually was missing the tags! Ugh, no wonder if rarely comes up in searches. Fixed, finally.