[ad_1]
While most of the U.S. was sleeping, Amazon Web Services (AWS) suffered a major disruption at one of its largest locations. If you were sleeping, you probably didn’t even notice. If, however, you were up and trying to use ChatGPT, Snapchat, Reddit, Fortnite, or even Amazon, you definitely noticed.
According to the AWS status updates, the company reported “increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region.” The root cause was later identified as issues with DNS resolution of the DynamoDB API endpoint in that region, and the incident rippling into other AWS services.
I’m not going to pretend that I understand what all of those words mean, but what I do understand is this: the internet is much more fragile than most of us think about on a regular basis. 
A ripple across the internet
What started inside a single AWS region quickly became global. Major consumer and enterprise platforms reported outages. For example, Coinbase and other crypto/banking services noted impact. 
AWS first posted a notification at 3:11 a.m. ET, stating it was engaged in mitigation and investigation. By about 5:27 a.m. ET they announced “significant signs of recovery” though they warned that the backlog of requests to the affected services could mean that it would take time for everything to get back to normal.
For a disruption that only lasted a little over two hours, however, the impact was much larger—both for companies that depend on cloud computing, and for Amazon. I’ll explain:
Everything is connected
This outage illustrates a truth many users don’t recognize: the internet is more fragile than it seems. So many services that appear independent run the same foundational infrastructure. The beauty of cloud computing providers like AWS is that individual companies don’t have to spin up their own infrastructure. Instead, they can just buy it from Amazon.
More importantly, because so many companies are doing just that, the overall expense for those companies is far less than if they tried to do it themselves. That seems like a huge win—until something goes wrong. A single error or failure in one region of a major cloud provider can ripple through to millions of users and thousands of services.
To be clear, Amazon is very good at this. There is a reason so many companies depend on AWS—because it’s generally very reliable, with better than 99.99 percent availability.
Which leads to another important point—the internet isn’t the only thing more fragile than we might think. For AWS, nothing is as fragile as trust.
Trust matters most
I’ve written many times that trust is your most important asset. If you want to build a platform that others depend on, they have to believe you’ll be more reliable than if they did it themselves. For most companies, that’s obviously true. Most companies don’t power huge swaths of the internet the way AWS does. It’s a no brainer
That’s why Amazon’s response matters so much. Within minutes of identifying an issue, AWS updates its Service Health Dashboard, a public status site that details affected regions, and services, and explains how the company is working to mitigate effects. Those updates are often timestamped and written in plain, operational language: “We are investigating increased error rates in the US-EAST-1 Region.”
As the incident unfolds, AWS posts incremental updates rather than waiting for a full explanation. The key lesson here is that communication itself is part of the recovery process.
When service stabilizes, AWS issues a “Post-Event Summary,” outlining the technical cause, the scope of impact, and steps taken to make sure it doesn’t happen again. This practice isn’t exclusive to AWS, but it’s definitely unusual in big tech. Many companies prefer to issue vague, after-the-fact statements or none at all.
AWS treats the visibility of its operations as essential as its infrastructure. Amazon’s entire cloud business depends on trust from developers, startups, governments, and Fortune 500s who run their critical business on AWS.
Every update is a signal that Amazon understands how much is at stake and that it’s willing to expose its process to public scrutiny. Transparency won’t erase the frustration of having your online store or streaming service go down, but it does reassure customers that AWS takes reliability seriously enough to narrate its own failures in real time.
Not only that, but the biggest concern when services go down is that it’s some kind of attack. If you’re AWS and you know that’s not the case, you let people know as quickly as you can, even if it means admitting there was a mistake or that something failed.
In the long run, that candor may be what keeps customers from looking elsewhere—because if your job is to be the backbone of the internet, trust may be the most fragile thing of all. Because in the cloud era, what you lose most during failure may not just be access for a few minutes—it might be the confidence that you still belong on the backbone of the internet.
The opinions expressed here by Inc.com columnists are their own, not those of Inc.com.
[ad_2]
Jason Aten
Source link