AWS Catastrophic Data Loss

dafyre

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

1337

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

dafyre

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

Dashrender

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

BRRABill

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

Dashrender

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.

So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...

BRRABill

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.

So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...

No they aren't.

They are saying that if the service (aka O365 or AWS) has an issue with THEIR system (which happens to hold your data) then THEIR system is backed up.

But it sounds like in this situation even that failed.

PhlipElder

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

It's amazing. A data centre touted as highly available, cloud only according to some marketing folks, has so many different single points of failure that can bring things down.

I can't count the number of times HVAC "redundant" systems have been the source, or blamed, for system wide outages or outright hardware failures.

Oh, and ATS (Automatic Transfer Switch) systems blowing out A/B/C even though the systems are supposed to be redundant.

A/B/C failure from one power provider causing a cascade failure.

Generator failures as mentioned here in the first article.

Storms.

The moral of this story is: Back Up. Back Up. Back the eff up.

PhlipElder

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

It's amazing. A data centre touted as highly available, cloud only according to some marketing folks, has so many different single points of failure that can bring things down.

I can't count the number of times HVAC "redundant" systems have been the source, or blamed, for system wide outages or outright hardware failures.

Oh, and ATS (Automatic Transfer Switch) systems blowing out A/B/C even though the systems are supposed to be redundant.

A/B/C failure from one power provider causing a cascade failure.

Generator failures as mentioned here in the first article.

Storms.

The moral of this story is: Back Up. Back Up. Back the eff up.

Oh, and one more thing: Thinking a distributed system, whether storage or region or whatever, is a "backup" is like saying RAID is a backup. It is not. Period.

dafyre

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

Dashrender

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.

Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.

Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.

It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.

So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...

No they aren't.

They are saying that if the service (aka O365 or AWS) has an issue with THEIR system (which happens to hold your data) then THEIR system is backed up.

But it sounds like in this situation even that failed.

to many blood their's in there...

I think you are saying that MS has it's own backups for cases where MS's DC blows the hell up. Then they can restore that data - to the last backup point.

What I'm saying is that that is USELESS to end users/corporate customers... because the chances that MS's DC is going to blow up is extremely small compared to other types of data loss, like Cryptoware, users, etc. In those cases, MS will say - tough tits.. we don't provide backups for your data, that is on you.

PhlipElder

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

1337

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.

The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.

dafyre

@Pete-S said in AWS Catastrophic Data Loss:

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.

The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.

I agree with you here somewhat. There is no such thing as "won't fail." There's always a chance of failure. The more money you can throw at it, the less likely a cascade of failures is to happen.

Having worked in IT a long time, I can count on one hand the number of times I've sen PLCs fail in similar situations. But they are electronics and they're not going to be 100% reliable when there are voltage spikes and power brown outs.

Their goal is definitely to make money, but they also have to spend enough to protect their reputation when stuff like this does happen (and it will).

dafyre

@PhlipElder said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

@dafyre said in AWS Catastrophic Data Loss:

@Pete-S said in AWS Catastrophic Data Loss:

Message From Amazon AWS :

Update August 28, 2019 JST:

That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

It's all fly by the seat of the pants theory until the sh#t statement above happens.

There is a way to full test resiliency at Hyper-Scale. But to @Pete-S 's comment about money, there's a cost to doing that kind of planning and testing and improving, and retesting. A company with a goal of purely to make as much money as humanly possible is probably not going to put enough money into a FULL resiliency test, but it can be done...

BRRABill

@Dashrender said in AWS Catastrophic Data Loss:

What I'm saying is that that is USELESS to end users/corporate customers...

I've been arguing that for years.

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Dashrender

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

dafyre

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

Same difference, though. Think about the number of issues that some folks have with Microsoft.

DustinB3403

@dafyre said in AWS Catastrophic Data Loss:

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

Same difference, though. Think about the number of issues that some folks have with Microsoft.

Paging @scottalanmiller !!!

PhlipElder

@Dashrender said in AWS Catastrophic Data Loss:

@BRRABill said in AWS Catastrophic Data Loss:

because the chances that MS's DC is going to blow up is extremely small

And yet, it is what this thread is about ... exactly that happening.

Except that it's Amazon, not MS.

MS was US Central this year or late last.

MS was the world when their authentication mechanism went down I think it was a year or so ago.

MS was Europe offline with VMs hosed and a recovery needed. Weeks.

MS has had plenty of trials by fire.

Not one of the hyper-scale folks are trouble free.

Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades. Cloud can't touch that. Period.