Why and How Amazon Web Services’ US-East-1 Failed • The Register


Amazon released additional information about last week’s US-East-1 outage, which revealed that its employees had to make their way through the log files when the web giant’s own monitoring tools were affected.

Amazon doesn’t seem to want to reveal a lot of technical details about its internal systems. It’s somewhat understandable; most likely, a few experts would be horrified, a few more would scour it for clues for a future attack, and the rest of the world wouldn’t understand or care. In any case, this could put off some customers, current or potential.

AWS falters in eastern US region causing widespread blackouts


There is an internal AWS network, which hosts unspecified internal services that are used to create and manage unspecified internal AWS resources. Some other internal services are hosted on the main AWS network. Amazon doesn’t tell the world much about this internal network, but it has multiple connections to the outside world and the “scale” goliath cloud.[s] the capacity of that network significantly “to ensure its high availability. And this is the process that has gone wrong.

An autoscaling tool was launched to scale one of the internal services – the one running on the main AWS network – and it went wrong, triggering “a sharp increase in connection activity “.

Basically, this flooded the internal network, which slowed down the internal DNS and made it unusable, as well as Amazon’s internal monitoring tools. Poor operators were forced to rely on log files to trace the problem. This sounds awfully twentieth century to harassed sysadmins, which puts them at least a few centuries ahead of Amazon warehouse workers.

While the report refrains from blaming DNS entirely, it appears that moving the internal DNS to another network, which took about two hours, gave administrators enough leeway to determine what has been wrong. He also points out that only AWS’s internal management network has been overloaded with uselessness, not AWS itself.

As we noted last week, us-east-1 is the first and oldest of the 21 AWS Regions, and a side effect is that this is where the AWS Global Console landing page is hosted. Like a Reg reader noted: “The AWS console is having issues … This is a major flaw IMHO, where if us-east-1 goes down, then the console landing page disappears. “

It is quite ironic that a service that allows its clients to distribute their workloads around the world does not do the same with some of its own core services. He noted in the outage report: “We have also deployed an additional network configuration that protects potentially affected network devices, even in the face of a similar congestion event. These corrective actions give us assurance that we will not see this problem reoccurring.

The issue also removed the service health dashboard, support contact center, and the Amazon Connect service it runs for customers.

It’s a reminder that, as security man Brian Krebs’ blog recently put it, “The Internet is held together with Spit & Baling Wire.” ®


Comments are closed.