Amazon S3 Outage shows the danger of doing things cheaply.


  • Service Provider

    https://medium.com/dara-it/amazon-s3-outage-shows-the-danger-of-doing-things-cheaply-ce335e5b7edf#.bormdw6jd

    The TLDR is, Amazon like any technology tool can be extremely reliable or you can still use it with an element of risk. How much time and money you spend on the tool will make all the difference.



  • I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?



  • @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

    I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

    I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?



  • @Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

    I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

    I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?

    Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.



  • I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience. I move from on-premise to Amazon precisely because they have the economies of scale and expertise to manage the infrastructure better than I can. So of course I can blame them if they fail, unless it's within their SLA (which I don't think they have?). If I have to start bringing the management of resilience and redundancy back in-house, then part of the point of cloud services disappears. It has nothing to do with cost.



  • @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience. I move from on-premise to Amazon precisely because they have the economies of scale and expertise to manage the infrastructure better than I can. So of course I can blame them if they fail, unless it's within their SLA (which I don't think they have?). If I have to start bringing the management of resilience and redundancy back in-house, then part of the point of cloud services disappears. It has nothing to do with cost.

    Well you can definitely outsource all of those things, but just because you move things to AWS or any cloud service doesn't make you instantly failure proof. If you need a certain level of uptime, whomever is managing this for your, be it you or someone you hire, has to know your expectations and purchase accordingly.

    Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?


  • Service Provider

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

    But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

    What you are buying is access to resources on their platform, you can buy resources in Asia, US, Europe but it is completely up to you to design and manage these resources so they work according to your needs and are fit for purpose.

    Amazon are indeed the best provider but traditional wisdom still applies, good system design, failure/recovery planning and calculating the costs of the potential risk versus the spend to prevent it.


  • Service Provider

    @coliver said

    Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

    AWS costs are a bit of mine-field so for simplicity of math lets assume equal spend if you want copies.

    S3 does indeed have geo distribution for the storage but that costs more money.

    Also...
    http://searchaws.techtarget.com/news/2240223024/Code-Spaces-goes-dark-after-AWS-cloud-security-hack

    If all of your eggs are inside a single management portal, what happens if that 1 management portal gets breached?



  • @Breffni-Potter said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @coliver said

    Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

    AWS costs are a bit of mine-field so for simplicity of math lets assume equal spend if you want copies.

    S3 does indeed have geo distribution for the storage but that costs more money.

    Also...
    http://searchaws.techtarget.com/news/2240223024/Code-Spaces-goes-dark-after-AWS-cloud-security-hack

    If all of your eggs are inside a single management portal, what happens if that 1 management portal gets breached?

    But that still doesn't really answer the question. What is the cost of building out that resiliency and does it actually benefit the company? My guess, in a lot of cases, that the cost of building resiliency costs more then the cost of downtime. The code spaces issue was entirely on them. Not having a decent backup system isn't the fault of the cloud, AWS, or the management engine.


  • Service Provider

    @coliver It depends, in this article I just looked at S3 storage alone and without any bandwidth charges.

    What is the app, how much storage, how much processing, power and memory, how much bandwidth, how many users, which geographic locations they need to serve. Complexity of the app or service to install/update/deploy

    Its like any IT project. What is the risk of this happening, what is the cost of prevention and making an informed decision. The same thought process and decision making that goes into a single server setup or dual server setup on prem stays when we look into cloud computing.



  • @Breffni-Potter said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @coliver It depends, in this article I just looked at S3 storage alone and without any bandwidth charges.

    What is the app, how much storage, how much processing, power and memory how much bandwidth, how many users, which geographic locations they need to serve. Complexity of the app or service

    Its like any IT project. What is the risk of this happening, what is the cost of prevention and making an informed decision. The same thought process and decision making that goes into a single server setup or dual server setup on prem stays when we look into cloud computing.

    Agreed completely. Which is why I bring it up. This may not be an issue of companies being cheap or doing the cheapest thing for the sake of expense. This is likely companies being fiscally responsible and choosing the best option with the greatest return. S3, and AWS, have been ridiculously reliable, most times having only a few hours of unplanned outages a year, and even fewer planned outages. I agree companies can't place the blame on AWS when things go down, but they can say our infrastructure is having issues due to an AWS outage.


  • Service Provider

    @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

    That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.


  • Service Provider

    @Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

    I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

    I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?

    Yup. Big companies normally consider the risks versus the costs and determine if protecting against something makes sense.



  • @scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

    That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

    I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.


  • Service Provider

    @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

    That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

    I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.

    Very easily for sure.



  • @Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

    Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

    I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?


  • Service Provider

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

    Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

    I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

    A bit below 100%. Their uptime is from using multiple data centers.



  • Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?


  • Service Provider

    @travisdh1 said in Amazon S3 Outage shows the danger of doing things cheaply.:

    Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

    Always something simple.



  • @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

    Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

    I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

    DC as in Datacenter failover - i.e. this DC is offline for whatever reason, so now your data/services is running in another DC?
    Even if it is listed at 100% the SLA just gives them an out when they don't meet it, i.e. they get to send you a check, that's it. Nothing more. I wouldn't expect them to realtime clone your data to another DC unless you're paying for that feature, at which point your SLA would be even higher, and you're paying a TON more.



  • Right. So why would you think that I would think that if I put data in just one DC I would have DC failover? That doesn't make any sense.



  • I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.


  • Service Provider

    @Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

    Oh exactly. It is what it is. Single DC dependency, and on a single service in that DC. AWS tells us what to do if we need higher reliability than that. They were within SLA, I believe. It's all good.



  • @scottalanmiller

    Exactly. We host everything at HQ on site, with a colo in Essex for DR purposes. If HQ was lost, staff are screwed until we restore from backups BUT... customers (the important part) are not really affected at all. We keep a hot copy of our websites and databases running a day out of date (which the business are fine with) in the Essex colo. We then use 'the cloud' to manage the failover process, which is a cheap solution compared to multiple Cloud DC's hosting everything.

    We have one VM in Azure, and one in AWS. Both check our websites hosted at HQ are available on HTTP/HTTPS every second or so. If not responding, they will use Cloudflare API and point DNS for all our websites to the hot running copies in the colo that are a day out of date. Pretty fast. When tested, it takes seconds and were back online from a customer perspective. Our test, unplug out gateway firewall and see what happens... easy.

    Yeah it can be better, but it meets our needs and other than cloudflare (which does go down) we have no single point of failure... We're happy with that risk.



  • @Breffni-Potter said:

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

    But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

    Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
    Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
    Designed to sustain the concurrent loss of data in two facilities.

    Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

    All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

    If the argument we're having is "you're not paying for 100% availability" then I agree with you. If your argument is "you're not paying for resilience" then I struggle to agree with you.



  • @Carnival-Boy Agree. Nice.


  • Service Provider

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    @Breffni-Potter said:

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

    But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

    Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
    Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
    Designed to sustain the concurrent loss of data in two facilities.

    Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

    All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

    99.99% is very low availability. 99.999% is "standard" availability. High availability is 99.9999%. They are selling 99.99% uptime, that can't be considered "selling reliability" as it is far too unreliable for that. It's fine for most customers, most customers don't need much availability.

    So I read the same thing as saying "designed for 99.99% availability" which is a direct statement making it super clear that Amazon S3, unless you do things yourself to make it high availability, is not at all designed for "availability" as a target feature. To me, they've clarified that in what you quoted to make sure we don't assume that availability is their specialty.

    And they meet 99.99% with ease. 90% would mean that they were down for nearly a month, not an afternoon.


  • Service Provider

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    If your argument is "you're not paying for resilience" then I struggle to agree with you.

    You are paying for a very specific level of resilience which is considered "low". So you "are paying for resilience", but not high resilience.



  • Yeah, I'd agree with that. You're right that 99.99% is low.


  • Service Provider

    @Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

    Yeah, I'd agree with that. You're right that 99.99% is low.

    Or low-ish at least. It's four nines. It's more than I expect from an average SAN :) Less than I expect from an average server.


Log in to reply
 

Looks like your connection to MangoLassi was lost, please wait while we try to reconnect.