Excellent Discussion on Risk Management



  • http://www.theregister.co.uk/2015/12/11/electrician_cuts_the_wrong_wire_and_brings_down_25000_square_feet_of_data_centre/

    Read the article and the comments - a very good discussion is had on how to quantify risk and how to understand it as it applies to your business.



  • Stolen from comments section

    Naselus writes:
    "would a "99/100 chance of success" put anyone off?"
    Anyone who's thinking in terms of datacenter risk scaling, yes. If you have a contractual obligation to 9 9s uptime, then 99/100 chance of success is horrifyingly risky. Think about it by converting it into the number of days you are allowed per single day of downtime.
    99/100 means 3 days and a half days of downtime in a year.
    99.9 means 8 hours downtime in a year.
    99.99 means 50 minutes downtime in a year
    99.999 means 5 minutes downtime in a year - this is the minimum level any serious hosting data centre would ever claim to.
    By the time you get to 9 9s, you have about 30 milliseconds - as in, your customer won't notice the downtime in the middle of a ping test.
    So, when the IT guy says 'there's only a 99% chance of success', what he's saying is 'this is ten million times more risky than our uptime SLA allows for, do not do this under any circumstances'. You can then schedule downtime which is excluded from your SLA uptime target.
    Beancounters really ought to understand this, since shoveling risk around is part of their job.



  • I say this all of the time. RAID 5, even with small high end SAS drives, rarely gets better than 99% chance of successfully not losing your data. 1% chance of losing data is way, way higher risk than business side people are generally led to believe their RAID will protect them against.