AWS Catastrophic Data Loss
-
@FATeknollogee said in AWS Catastrophic Data Loss:
@PhlipElder said in AWS Catastrophic Data Loss:
@IRJ said in AWS Catastrophic Data Loss:
For IaaS, using a tool like
terraform
can help you transition from one platform to another asterraform
is compatible with many cloud hosts.I feel like I'm back in the early 2000s when Microsoft released Small Business Server 2000 then Small Business Server 2003 with the business owner DIY message. We got a lot of calls as a result of that messaging over the years.
Then, there was the mess created by the "IT Consultant" that didn't know their butt from a hole in the ground. We cleaned up a lot of those over the years.
At least in the above cases we could work with some sort of box to get their data on a roll.
Today, that possibility is virtually nil.
That is, the business owner being knowledgeable enough to navigate the spaghetti of cloud services setup to get to a point where they are secure and backed up for one. For another, as mentioned above, how many folks know how to set up any cloud?
Then, toss into the mix the message about speed and agility and we have a deadly mix beyond the SBS messaging and failures in that we're talking orders of magnitude more folks losing their businesses as a result of one big FUBAR.
Ever been on the back of a bike holding a case of beer while the "driver" hit 200+ KPH? I have. Once. And lived to never, ever, ever, trust an arse like that again.
The powerful power of marketing
Cloud = :couple_with_heart: :kissing_face_with_smiling_eyes: :kissing_cat_face_with_closed_eyes:Your understanding of cloud is quite exquisite
-
@IRJ said in AWS Catastrophic Data Loss:
@PhlipElder said in AWS Catastrophic Data Loss:
For another, as mentioned above, how many folks know how to set up any cloud?
That's why the cloud guys make the :money_bag: :money_bag: :money_bag: :money_bag:
Cloud isn't even that difficult. Most could learn the basic concepts in an hour or two.
Like I said, our tech bubble.
-
@PhlipElder said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
@PhlipElder said in AWS Catastrophic Data Loss:
@wrx7m said in AWS Catastrophic Data Loss:
This was one AZ, right? If so, you need to design your environment to span multiple AZs, if not regions. This is beginner AWS design theory.
A few things come to mind:
1: Just how many folks know how to architect a highly available solution in any cloud?
2: At what cost over and above the indicated method does the HA setup incur?
3: It does not matter where the data is, it should be backed up.Microsoft's central US DC failure, I think it was last year or early this year, cause a substantial amount of data loss as well. Not sure if any HA setup could have saved them from what I recall.
How many people backup their O365 systems? I am willing to bet VERY few!! yet, if MS were to have the same issue, customers would find themselves in a similar situation.
One (invalid) claim I see from time to time when migrating to the cloud - it saves money because backups are part of the solution... which we can see here is definitely not the case.
Veeam was one of the first ones on the block to back up O365. That's messaging that Microsoft has not made clear but I've seen in the grapevine as far as the customer being responsible to do so.
No. My sh#t on their sh#t means no sh#t if something takes a sh#t.
This is a discussion that @scottalanmiller has been asked about before and I could have sworn that for the most part - he was against the need to specifically backup O365.
-
@Dashrender said in AWS Catastrophic Data Loss:
@PhlipElder said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
@PhlipElder said in AWS Catastrophic Data Loss:
@wrx7m said in AWS Catastrophic Data Loss:
This was one AZ, right? If so, you need to design your environment to span multiple AZs, if not regions. This is beginner AWS design theory.
A few things come to mind:
1: Just how many folks know how to architect a highly available solution in any cloud?
2: At what cost over and above the indicated method does the HA setup incur?
3: It does not matter where the data is, it should be backed up.Microsoft's central US DC failure, I think it was last year or early this year, cause a substantial amount of data loss as well. Not sure if any HA setup could have saved them from what I recall.
How many people backup their O365 systems? I am willing to bet VERY few!! yet, if MS were to have the same issue, customers would find themselves in a similar situation.
One (invalid) claim I see from time to time when migrating to the cloud - it saves money because backups are part of the solution... which we can see here is definitely not the case.
Veeam was one of the first ones on the block to back up O365. That's messaging that Microsoft has not made clear but I've seen in the grapevine as far as the customer being responsible to do so.
No. My sh#t on their sh#t means no sh#t if something takes a sh#t.
This is a discussion that @scottalanmiller has been asked about before and I could have sworn that for the most part - he was against the need to specifically backup O365.
The prototype of the current model, IMO, was explained by the Exchange Team at the first Microsoft Exchange Conference that I attended back when.
They dogfooded 400 mailboxes using a distributed model with Exchange 2013 non-production bits. No backups. There were a few surprised looks in the room when that was announced.
The G00g demonstrated quite clearly what happens when the distributed model fails some years ago by losing mailboxes (a client of ours was affected by that - lost their entire business continuity).
I wish I could say I could count on one hand the number of times I've dealt with a, "Help! My server crashed and we don't have any backups". But not.
Oh, and Maersk nearly effed themselves if it weren't for an offline DC in Africa. I think it was Ghana thinking their distributed domain would withstand anything. Until they were encrypted with no DC backups. Anywhere. SMH
-
as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.
-
Update August 28, 2019 JST:
As we mentioned in our initial summary, this event impacted a small portion of a single Availability Zone (āAZā) in our Tokyo Region. The impact was to the Amazon EC2 and Amazon EBS resources in that AZ, though some other services (such as RDS, Redshift, ElastiCache, and Workspaces) would have seen some impact in that AZ if their underlying EC2 instances were affected. As we have further investigated this event with our customers, we have discovered a few isolated cases where customers' applications running across multiple Availability Zones saw unexpected impact (i.e. some customers using Application Load Balancer in combination with AWS Web Application Firewall or sticky sessions, saw a higher than expected percent of requests return an Internal Server Error). We are sharing additional details on these isolated issues directly with impacted customers.
Summary of the Amazon EC2 and Amazon EBS Service Event in the Tokyo (AP-NORTHEAST-1) RegionWeād like to give you some additional information about the service disruption that occurred in the Tokyo (AP-NORTHEAST-1) Region on August 23, 2019. Beginning at 12:36 PM JST, a small percentage of EC2 servers in a single Availability Zone in the Tokyo (AP-NORTHEAST-1) Region shut down due to overheating. This resulted in impaired EC2 instances and degraded EBS volume performance for some resources in the affected area of the Availability Zone. The overheating was due to a control system failure that caused multiple, redundant cooling systems to fail in parts of the affected Availability Zone. The affected cooling systems were restored at 3:21 PM JST and temperatures in the affected areas began to return to normal. As temperatures returned to normal, power was restored to the affected instances. By 6:30 PM JST, the vast majority of affected instances and volumes had recovered. A small number of instances and volumes were hosted on hardware which was adversely affected by the loss of power and excessive heat. It took longer to recover these instances and volumes and some needed to be retired as a result of failures to the underlying hardware.
In addition to the impact to affected instances and EBS volumes, there was some impact to the EC2 RunInstances API. At 1:21 PM JST, attempts to launch new EC2 instances targeting the impacted Availability Zone and attempts to use the āidempotency tokenā (a feature which allows customers to retry run instance commands without risking multiple resulting instance launches) with the RunInstances API in the region began to experience error rates. Other EC2 APIs and launches that did not include an āidempotency token,ā continued to operate normally. This issue also prevented new launches from Auto Scaling which depends on the āidempotency tokenā. At 2:51 PM JST, engineers resolved the issue affecting the āidempotency tokenā and Auto Scaling. Launches of new EC2 instances in the affected Availability Zone continued to fail until 4:05 PM JST, when the EC2 control plane subsystem had been restored in the impacted Availability Zone. Attempts to create new snapshots for affected EBS volumes, also experienced increased error rates during the event.
This event was caused by a failure of our datacenter control system, which is used to control and optimize the various cooling systems used in our datacenters. The control system runs on multiple hosts for high availability. This control system contains third-party code which allows it to communicate with third-party devices such as fans, chillers, and temperature sensors. It communicates either directly or through embedded Programmable Logic Controllers (PLC) which in turn communicate with the actual devices. Just prior to the event, the datacenter control system was in the process of failing away from one of the control hosts. During this kind of failover, the control system has to exchange information with other control systems and the datacenter equipment it controls (e.g., the cooling equipment and temperature sensors throughout the datacenter) to ensure that the new control host has the most up-to-date information about the state of the datacenter. Due to a bug in the third-party control system logic, this exchange resulted in excessive interactions between the control system and the devices in the datacenter which ultimately resulted in the control system becoming unresponsive. Our datacenters are designed such that if the datacenter control system fails, the cooling systems go into maximum cooling mode until the control system functionality is restored. While this worked correctly in most of the datacenter, in a small portion of the datacenter, the cooling system did not correctly transition to this safe cooling configuration and instead shut down. As an added safeguard, our datacenter operators have the ability to bypass the datacenter control systems and put our cooling system in āpurgeā mode to quickly exhaust hot air in the event of a malfunction. The team attempted to activate purge in the affected areas of the datacenter, but this also failed. At this point, temperatures began to rise in the affected part of the datacenter and servers began to power off when they became too hot. Because the datacenter control system was unavailable, the operations team had minimum visibility into the health and state of the datacenter cooling systems. To recover, the team had to manually investigate and reset all of the affected pieces of equipment and put them into a maximum cooling configuration. During this process, it was discovered that the PLCs controlling some of the air handling units were also unresponsive. These controllers needed to be reset. It was the failure of these PLC controllers which prevented the default cooling and āpurgeā mode from correctly working. After these controllers were reset, cooling was restored to the affected area of the datacenter and temperatures began to decrease.
We are still working with our third-party vendors to understand the bug, and subsequent interactions, that caused both the control system and the impacted PLCs to become unresponsive. In the interim, we have disabled the failover mode that triggered this bug on our control systems to ensure we do not have a recurrence of this issue. We have also trained our local operations teams to quickly identify and remediate this situation if it were to recur, and we are confident that we could reset the system before seeing any customer impact if a similar situation was to occur for any reason. Finally, we are working to modify the way that we control the impacted air handling units to ensure that āpurge modeā is able to bypass the PLC controllers completely. This is an approach we have begun using in our newest datacenter designs and will make us even more confident that āpurge modeā will work even if PLCs become unresponsive.
During this event, EC2 instances and EBS volumes in other Availability Zones in the region were not affected. Customers that were running their applications thoroughly across multiple Availability Zones were able to maintain availability throughout the event. For customers that need the highest availability for their applications, we continue to recommend running applications with this multiple Availability Zone architecture; any application component that can create availability issues for customers should run in this fault tolerant way.
We apologize for any inconvenience this event may have caused. We know how critical our services are to our customersā businesses. We are never satisfied with operational performance that is anything less than perfect, and we will do everything we can to learn from this event and drive improvement across our services.
-
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
-
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
-
@Pete-S said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.
-
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.
eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,
-
@Dashrender said in AWS Catastrophic Data Loss:
as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.
Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.
Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.
-
@BRRABill said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.
Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.
Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.
It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.
So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...
-
@Dashrender said in AWS Catastrophic Data Loss:
@BRRABill said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.
Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.
Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.
It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.
So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...
No they aren't.
They are saying that if the service (aka O365 or AWS) has an issue with THEIR system (which happens to hold your data) then THEIR system is backed up.
But it sounds like in this situation even that failed.
-
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
It's amazing. A data centre touted as highly available, cloud only according to some marketing folks, has so many different single points of failure that can bring things down.
I can't count the number of times HVAC "redundant" systems have been the source, or blamed, for system wide outages or outright hardware failures.
Oh, and ATS (Automatic Transfer Switch) systems blowing out A/B/C even though the systems are supposed to be redundant.
A/B/C failure from one power provider causing a cascade failure.
Generator failures as mentioned here in the first article.
Storms.
The moral of this story is: Back Up. Back Up. Back the eff up.
-
@PhlipElder said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
It's amazing. A data centre touted as highly available, cloud only according to some marketing folks, has so many different single points of failure that can bring things down.
I can't count the number of times HVAC "redundant" systems have been the source, or blamed, for system wide outages or outright hardware failures.
Oh, and ATS (Automatic Transfer Switch) systems blowing out A/B/C even though the systems are supposed to be redundant.
A/B/C failure from one power provider causing a cascade failure.
Generator failures as mentioned here in the first article.
Storms.
The moral of this story is: Back Up. Back Up. Back the eff up.
Oh, and one more thing: Thinking a distributed system, whether storage or region or whatever, is a "backup" is like saying RAID is a backup. It is not. Period.
-
@Dashrender said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.
eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,
The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.
-
@BRRABill said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
@BRRABill said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
as I understand it - MS has backups, but they are only used by MS when they have a major issue and they need to restore that - like a lost DC or something.
Yes, I believe this kind of issue is the exact thing he would say O365 backups would "cover". Of course if your user deleted something or you got hacked, no backups there for that kind of thing.
Be interesting to hear his take (@scottalanmiller) at some point. Because it is ultimately what I always argued about. I don't care WHO it is ... I don't trust them. Especially if the data is critical/important.
It's not MS deleting/loosing data that most people are worried about - because that happens SO rarely.. .it's cryptoware, users, etc deleting data that is a much bigger threat.
So frankly how a cloud backup solution isn't just part of the deal is something I don't understand - of course others (not Scott) in this thread are saying - OF COURSE it is...
No they aren't.
They are saying that if the service (aka O365 or AWS) has an issue with THEIR system (which happens to hold your data) then THEIR system is backed up.
But it sounds like in this situation even that failed.
to many blood their's in there...
I think you are saying that MS has it's own backups for cases where MS's DC blows the hell up. Then they can restore that data - to the last backup point.
What I'm saying is that that is USELESS to end users/corporate customers... because the chances that MS's DC is going to blow up is extremely small compared to other types of data loss, like Cryptoware, users, etc. In those cases, MS will say - tough tits.. we don't provide backups for your data, that is on you.
-
@dafyre said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.
eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,
The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.
That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.
It's all fly by the seat of the pants theory until the sh#t statement above happens.
-
@PhlipElder said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.
eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,
The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.
That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.
It's all fly by the seat of the pants theory until the sh#t statement above happens.
I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.
The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.
-
@Pete-S said in AWS Catastrophic Data Loss:
@PhlipElder said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Dashrender said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
@dafyre said in AWS Catastrophic Data Loss:
@Pete-S said in AWS Catastrophic Data Loss:
Update August 28, 2019 JST:
That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...
It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.
Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.
It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.
eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,
The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.
That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.
It's all fly by the seat of the pants theory until the sh#t statement above happens.
I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.
The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.
I agree with you here somewhat. There is no such thing as "won't fail." There's always a chance of failure. The more money you can throw at it, the less likely a cascade of failures is to happen.
Having worked in IT a long time, I can count on one hand the number of times I've sen PLCs fail in similar situations. But they are electronics and they're not going to be 100% reliable when there are voltage spikes and power brown outs.
Their goal is definitely to make money, but they also have to spend enough to protect their reputation when stuff like this does happen (and it will).