The long tail of AWS outages

Amazon Web Services is sprawling Cloud outage The events that began early Monday morning highlighted the fragile interconnectedness of the Internet as major telecommunications, financial, healthcare, education and government platforms around the world suffered disruptions. as Wore todayAWS diagnosed the issue and began working to correct it, which originated from the company’s critical US-EAST-1 region based in Northern Virginia. But the chain of effects took time to fully resolve.

Researchers pondering the incident particularly highlighted the length of Monday’s outage, which began around 3 a.m. ET on Monday, October 20. AWS said in status updates that as of 6:01 PM ET on Monday “all AWS services had returned to normal operations.” The outage arose directly from Amazon’s DynamoDB APIs and, according to the company, “affected” 141 other AWS services. Several network engineers and infrastructure specialists told WIRED that errors are understandable and inevitable for so-called “hyperscalers” like AWS, Microsoft Azure and Google Cloud Platform, given their complexity and sheer scale. But they also point out that this reality should not simply excuse cloud providers from being out of business for an extended period.

“Hindsight is key,” says Ira Winkler, chief information security officer at reliability and cybersecurity firm CYE. “It’s easy to spot what went wrong after the fact, but the overall reliability of AWS shows how difficult it is to prevent every failure.” “Ideally, this would be a lesson learned, and Amazon would implement further redundancies that would prevent a disaster like this from happening in the future — or at least prevent them from remaining down for as long as they do.”

AWS did not respond to WIRED’s questions about the length of time for customer refunds. An AWS spokesperson says the company plans to publish one of its “post-event summaries” about the incident.

“I don’t think this was just an outage. I would have expected a full fix much faster,” says Jake Williams, vice president of research and development at Hunter Strategy. “Giving them their due, cascading failures are not something they get a lot of experience working with because they don’t experience power outages very often. So that’s to their credit. But it’s really easy to get into the mindset of giving these companies a pass, and we shouldn’t forget that they’re creating this situation by actively trying to attract more customers to their infrastructure. Customers don’t control what “If they’re expanding themselves or what might happen financially.”

The incident was caused by a familiar culprit of web outages – Domain Name System resolution issues. DNS is basically the Internet’s telephone directory mechanism for directing web browsers to the correct servers. As a result, DNS issues are a common source of outages because they can cause requests to fail and prevent content from loading.

Leave a ReplyCancel Reply