AWS Catastrophic Data Loss

1337

Update August 28, 2019 JST:

As we mentioned in our initial summary, this event impacted a small portion of a single Availability Zone (“AZ”) in our Tokyo Region. The impact was to the Amazon EC2 and Amazon EBS resources in that AZ, though some other services (such as RDS, Redshift, ElastiCache, and Workspaces) would have seen some impact in that AZ if their underlying EC2 instances were affected. As we have further investigated this event with our customers, we have discovered a few isolated cases where customers' applications running across multiple Availability Zones saw unexpected impact (i.e. some customers using Application Load Balancer in combination with AWS Web Application Firewall or sticky sessions, saw a higher than expected percent of requests return an Internal Server Error). We are sharing additional details on these isolated issues directly with impacted customers.
Summary of the Amazon EC2 and Amazon EBS Service Event in the Tokyo (AP-NORTHEAST-1) Region

We’d like to give you some additional information about the service disruption that occurred in the Tokyo (AP-NORTHEAST-1) Region on August 23, 2019. Beginning at 12:36 PM JST, a small percentage of EC2 servers in a single Availability Zone in the Tokyo (AP-NORTHEAST-1) Region shut down due to overheating. This resulted in impaired EC2 instances and degraded EBS volume performance for some resources in the affected area of the Availability Zone. The overheating was due to a control system failure that caused multiple, redundant cooling systems to fail in parts of the affected Availability Zone. The affected cooling systems were restored at 3:21 PM JST and temperatures in the affected areas began to return to normal. As temperatures returned to normal, power was restored to the affected instances. By 6:30 PM JST, the vast majority of affected instances and volumes had recovered. A small number of instances and volumes were hosted on hardware which was adversely affected by the loss of power and excessive heat. It took longer to recover these instances and volumes and some needed to be retired as a result of failures to the underlying hardware.

In addition to the impact to affected instances and EBS volumes, there was some impact to the EC2 RunInstances API. At 1:21 PM JST, attempts to launch new EC2 instances targeting the impacted Availability Zone and attempts to use the “idempotency token” (a feature which allows customers to retry run instance commands without risking multiple resulting instance launches) with the RunInstances API in the region began to experience error rates. Other EC2 APIs and launches that did not include an “idempotency token,” continued to operate normally. This issue also prevented new launches from Auto Scaling which depends on the “idempotency token”. At 2:51 PM JST, engineers resolved the issue affecting the “idempotency token” and Auto Scaling. Launches of new EC2 instances in the affected Availability Zone continued to fail until 4:05 PM JST, when the EC2 control plane subsystem had been restored in the impacted Availability Zone. Attempts to create new snapshots for affected EBS volumes, also experienced increased error rates during the event.

This event was caused by a failure of our datacenter control system, which is used to control and optimize the various cooling systems used in our datacenters. The control system runs on multiple hosts for high availability. This control system contains third-party code which allows it to communicate with third-party devices such as fans, chillers, and temperature sensors. It communicates either directly or through embedded Programmable Logic Controllers (PLC) which in turn communicate with the actual devices. Just prior to the event, the datacenter control system was in the process of failing away from one of the control hosts. During this kind of failover, the control system has to exchange information with other control systems and the datacenter equipment it controls (e.g., the cooling equipment and temperature sensors throughout the datacenter) to ensure that the new control host has the most up-to-date information about the state of the datacenter. Due to a bug in the third-party control system logic, this exchange resulted in excessive interactions between the control system and the devices in the datacenter which ultimately resulted in the control system becoming unresponsive. Our datacenters are designed such that if the datacenter control system fails, the cooling systems go into maximum cooling mode until the control system functionality is restored. While this worked correctly in most of the datacenter, in a small portion of the datacenter, the cooling system did not correctly transition to this safe cooling configuration and instead shut down. As an added safeguard, our datacenter operators have the ability to bypass the datacenter control systems and put our cooling system in “purge” mode to quickly exhaust hot air in the event of a malfunction. The team attempted to activate purge in the affected areas of the datacenter, but this also failed. At this point, temperatures began to rise in the affected part of the datacenter and servers began to power off when they became too hot. Because the datacenter control system was unavailable, the operations team had minimum visibility into the health and state of the datacenter cooling systems. To recover, the team had to manually investigate and reset all of the affected pieces of equipment and put them into a maximum cooling configuration. During this process, it was discovered that the PLCs controlling some of the air handling units were also unresponsive. These controllers needed to be reset. It was the failure of these PLC controllers which prevented the default cooling and “purge” mode from correctly working. After these controllers were reset, cooling was restored to the affected area of the datacenter and temperatures began to decrease.

We are still working with our third-party vendors to understand the bug, and subsequent interactions, that caused both the control system and the impacted PLCs to become unresponsive. In the interim, we have disabled the failover mode that triggered this bug on our control systems to ensure we do not have a recurrence of this issue. We have also trained our local operations teams to quickly identify and remediate this situation if it were to recur, and we are confident that we could reset the system before seeing any customer impact if a similar situation was to occur for any reason. Finally, we are working to modify the way that we control the impacted air handling units to ensure that “purge mode” is able to bypass the PLC controllers completely. This is an approach we have begun using in our newest datacenter designs and will make us even more confident that “purge mode” will work even if PLCs become unresponsive.

During this event, EC2 instances and EBS volumes in other Availability Zones in the region were not affected. Customers that were running their applications thoroughly across multiple Availability Zones were able to maintain availability throughout the event. For customers that need the highest availability for their applications, we continue to recommend running applications with this multiple Availability Zone architecture; any application component that can create availability issues for customers should run in this fault tolerant way.

We apologize for any inconvenience this event may have caused. We know how critical our services are to our customers’ businesses. We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and drive improvement across our services.

dafyre

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

1337

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

dafyre

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

Dashrender

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

BRRABill

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

Dashrender

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.

So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...

BRRABill

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.

So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...

No they aren't.

They are saying that if the service (aka O365 or AWS) has an issue with THEIR system (which happens to hold your data) then THEIR system is backed up.

But it sounds like in this situation even that failed.

PhlipElder

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

It's amazing. A data centre touted as highly available, cloud only according to some marketing folks, has so many different single points of failure that can bring things down.

I can't count the number of times HVAC "redundant" systems have been the source, or blamed, for system wide outages or outright hardware failures.

Oh, and ATS (Automatic Transfer Switch) systems blowing out A/B/C even though the systems are supposed to be redundant.

A/B/C failure from one power provider causing a cascade failure.

Generator failures as mentioned here in the first article.

Storms.

The moral of this story is: Back Up. Back Up. Back the eff up.

PhlipElder

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

It's amazing. A data centre touted as highly available, cloud only according to some marketing folks, has so many different single points of failure that can bring things down.

I can't count the number of times HVAC "redundant" systems have been the source, or blamed, for system wide outages or outright hardware failures.

Oh, and ATS (Automatic Transfer Switch) systems blowing out A/B/C even though the systems are supposed to be redundant.

A/B/C failure from one power provider causing a cascade failure.

Generator failures as mentioned here in the first article.

Storms.

The moral of this story is: Back Up. Back Up. Back the eff up.

Oh, and one more thing: Thinking a distributed system, whether storage or region or whatever, is a "backup" is like saying RAID is a backup. It is not. Period.

dafyre

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

Dashrender

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.

So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...

No they aren't.

They are saying that if the service (aka O365 or AWS) has an issue with THEIR system (which happens to hold your data) then THEIR system is backed up.

But it sounds like in this situation even that failed.

to many blood their's in there...

I think you are saying that MS has it's own backups for cases where MS's DC blows the hell up. Then they can restore that data - to the last backup point.

What I'm saying is that that is USELESS to end users/corporate customers... because the chances that MS's DC is going to blow up is extremely small compared to other types of data loss, like Cryptoware, users, etc. In those cases, MS will say - tough tits.. we don't provide backups for your data, that is on you.

PhlipElder

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

1337

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.

The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.

dafyre

@Pete-S said in AWS Catastrophic Data Loss:

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.

The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.

I agree with you here somewhat. There is no such thing as "won't fail." There's always a chance of failure. The more money you can throw at it, the less likely a cascade of failures is to happen.

Having worked in IT a long time, I can count on one hand the number of times I've sen PLCs fail in similar situations. But they are electronics and they're not going to be 100% reliable when there are voltage spikes and power brown outs.

Their goal is definitely to make money, but they also have to spend enough to protect their reputation when stuff like this does happen (and it will).

dafyre

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

There is a way to full test resiliency at Hyper-Scale. But to @Pete-S 's comment about money, there's a cost to doing that kind of planning and testing and improving, and retesting. A company with a goal of purely to make as much money as humanly possible is probably not going to put enough money into a FULL resiliency test, but it can be done...

BRRABill

@Dashrender said in AWS Catastrophic Data Loss:

What I'm saying is that that is USELESS to end users/corporate customers...

I've been arguing that for years.

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Dashrender

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

dafyre

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

Same difference, though. Think about the number of issues that some folks have with Microsoft.

DustinB3403

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

Same difference, though. Think about the number of issues that some folks have with Microsoft.

Paging @scottalanmiller !!!