There are very few people who can say that they haven’t heard of or been impacted by the Amazon Web Services (AWS) S3 outage this past Tuesday. When the world’s largest public cloud provider literally came crashing down for over six hours in the middle of the workweek, the global tremors are felt far and wide.
But the “what” of the S3 outage is old news at this point. The more pressing issue is ‘How?’ How does one explain the most connected, sophisticated, and automated software and services engine in the world going down for hours on end? That question is haunting not only the Azures and Googles of the world but the large private clouds and on-premises enterprise data centers as well. They are asking, “if it could happen to them, why not us?” But maybe I am reaching. Maybe this was a unique AWS occurrence. Either way, they claim to have fixed it, so we can all breathe easy now, right? Wrong!
This is the official explanation from AWS as published in their post-mortem. “An authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended“.
In other words, instead of removing one server, they removed hundreds of servers. Couldn’t this happen anywhere? Could it not happen to other infrastructure components in data centers, such as virtual machines or containers, as well? And could it not happen anywhere – public cloud, private cloud, and on-premise data centers? The answer to all of these questions is a resounding yes, and that should scare you.
Going forward we will be led to believe that the AWS machine will improve by leaps and bounds and will recover from this horrendous blunder – and that is probably true. But every private cloud and enterprise IT data center owner needs to look at this as a wake up call and put into practice some simple yet meaningful mechanisms that would go a long way in preventing similar disasters from befalling them. These are scope, control, and governance model.
Scope calls for a realistic guide to what each administrator is capable of doing at all times and ensuring that this is always within the bounds of what is acceptable. In the AWS case for instance, the scope could have been that an administrator can only work on 10 servers at a time. But what if the work legitimately requires work on more than 10? That’s where control comes in.
Control is the ability to have oversight on critical operations. With the AWS incident, it would require – as an example – approval from a higher authority if 100 servers needed to be acted upon all at once. In other words, ‘trust but verify.’
And finally, you need a governance model. This is really the implementation of best practices and a well-defined policy for enforcing the above two functions — scope overview and control enforcement — in a self-driven fashion. In this particular example, the policy would be to ensure that the number of servers an admin can operate remains under 10 (scope), and that any increase in that number automatically requires a manager’s approval (control). Further sophistication can easily be built in, where the manager could easily be a bot that checks the type of workloads and the load on the system and approves (or denies) the request. Bottomline – checks and balances.
There are a lot of scaremongers out there talking up the S3 outage as the demise of public cloud and a sign of how unreliable AWS is, but don’t get taken in by that. This can happen anywhere, and to anyone. The biggest takeaway should be (besides being glad it didn’t happen to your infrastructure) ways to learn from this event and move forward to live another day.