Amazon Corrects Massive AWS S3 Cloud Outage While Vendors React

March 7, 2017 Off By David
Article Written by David Marshall

Last Tuesday, parts of the Internet came to a grinding halt when the servers that powered them suddenly vanished.  The disappearing server act came from servers that were housed as part of Amazon S3, Amazon’s popular Web hosting service.

When that incident happened, several big and popular services and Web sites were disrupted, including DraftKings, Gizmodo, IFTTT, Quora, Slack and Trello.

According to the Web site monitoring firm Apica, 54 of the largest online retailers experienced performance impairments on their Web sites, with some slowing down by more than 20 percent; 3 sites went down completely (Express, Lulu Lemon, One Kings Lane); and for effected websites, average slow down time was 29.7 seconds – 42.7 seconds to load.

What happened?

"At 9:37 a.m. PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," Amazon said.  "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.  The servers that were inadvertently removed supported two other S3 subsystems."

Those subsystems are important.  One of them "manages the metadata and location information of all S3 objects in the region," according to Amazon.  And without it, services that depend on it couldn’t perform basic data retrieval and storage tasks.  The second subsystem, the placement subsystem, "manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate."  The placement subsystem is used to allocate storage for new objects.

While S3 was down, a variety of other Amazon Web services stopped functioning, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes and AWS. 

To address the problems, Amazon staff had to restart all of these subsystems.  And during the restart period, they were unable to service requests.  As part of Amazon’s official response, the company said that it would immediately begin implementing changes to its internal systems to prevent similar cascading problems from happening again in the future.

Wake-up Call

For some organizations that made the move to the public cloud, doing so may have been done with a "set it and forget it" mentality.  After all, migrating to a public cloud is supposed to inherently make things disaster proof, right?  Not so fast.

If anything, last week’s event should shine a light on the need to design for failure, whether on-premises with a private cloud, all in with a public cloud or using some type of hybrid or multi-cloud setup.   

"Outages like what happened to AWS and Amazon S3 are bound to happen," said Manoj Chaudhary, CTO and VP of Engineering at Loggly.  "But during these outages, logs are more important than ever for companies and customers, as they can capture data that would otherwise be lost and pinpoint the root cause of a service interruption."

To help address a problem like last week’s outage, Chaudhary told VMblog that in the end, you want to make sure you’re monitoring your monitoring infrastructure.  

"Diversifying risk by adopting a multi-cloud solution, and hosting monitoring applications in a different environment than the apps they are monitoring, ensures you have the ability to access and search data when it is needed most, even in the time of system outage."

Keeping Cloud Private

For some organizations, the recent Amazon outage could be a call to return data back to on-premises control.

"The turmoil caused by the AWS S3 outage shows just how vital reliable data access is," said Geoff Barrall, Chief Operating Officer, Nexsan.  

He continued, "With so many businesses utilizing a connected workforce, constant access to data is necessary to keep operating.  Any amount of downtime costs businesses time and money and can be more easily managed if data is kept within an organization’s own IT infrastructure.  With sophisticated file, sync and share capabilities, private cloud solutions can offer the flexibility that a connected workforce needs, with the security and control of on premises data storage."

Organizations and users did take to Social media and the Internet (those that were still online anyway) to express similar judgement.  But the question being asked by many is now, should this single event cause someone to blow up a public cloud migration?  In some cases, this will happen.  In other cases, a complete public cloud migration may go back to the design board.  And yes, this will keep some organizations from going to the public cloud and instead stay where they believe they can better maintain control of their future with on-premises private cloud.  The right answer is and always has been, what’s best for your organization.

Public Cloud – Manage Data by Region

It’s OK to put all your data in one public cloud, according to Don Foster, Senior Director of Solutions Marketing and Technical Alliances at Commvault, but you need a viewpoint of where the data lives across regions.  If a region has an outage, your data management platform should give you a clear view of data across multi-regions.

"If your data lives in the East, ensure you have a complete data backup in the West or a region on another continent," said Foster.  "If an outage happens, you can recover quickly in the other region and keep your business running during the service outage."

The important part here is backup.  

Foster explained, "Critical data and services native to the cloud should ensure backups are scheduled in/across/from clouds so your data is available.  Automated backups – and the ability to verify those backups – make your life a lot less stressful."

Diversifying Cloud

"In the financial markets, investors protect themselves from volatility by diversifying," explained Chuck Dubuque, VP of product and solution marketing, Tintri.  "The same might hold true for companies and organizations that rely on the cloud."

Dubuque added, "The S3 outage demonstrates the risks of putting all your eggs into one cart or cloud.  Moreover, it’s difficult to engineer even cloud native applications for public cloud SLAs as seen by these events.  It’s even more complex to deploy and manage enterprise applications that weren’t designed for the cloud to begin with.  If nothing else, the S3 outages will cause some businesses to reconsider a diversified environment-that includes enterprise cloud-to reduce their risks."

Disaster Recovery

"For the near foreseeable future, we’re going to hear commentary and see various business impact estimates related to the effects of the S3 outage," said Paul Zeiter, President, Zerto.  "Still, many IT professionals will be wondering what they should be doing differently to protect their organizations for when, not if, something like this happens again."

Zeiter went on to say, "The growing frequency of major headline-creating outages across every industry points to a systemic issue as IT environments become increasingly complex: Disaster recovery is just as essential as cyber security to protect enterprises from the mundane erroneous keystroke or power outages to natural catastrophes, but often under invested in. Business and IT leaders are getting ahead of the curve by carefully crafting their hybrid cloud strategies – one that gives them multiple layers of infrastructure redundancy protection – to achieve IT resilience that keeps critical business operations seamlessly moving forward.  This is possible using a combination of multiple cloud types for recovery including public, private, and managed to ensure any disruption is quickly remediated in a manner that is imperceptible to customers."

Amazon Corrections

As a result of this operational event, Amazon said it is making several changes to the way its systems are managed. 

"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly," the company said.

Amazon has already modified the tool that was used to pull down the intended servers.  It has not only been updated to remove servers more slowly in the future, but they have also added safeguards to prevent servers from being removed when it will bring the system below a minimum level of capacity.

Amazon also promised to make changes to improve the recovery time of key S3 subsystems and to audit its other operational tools to ensure they also have similar safety checks.

Finally, they will also make changes to the AWS Service Health Dashboard.  During the outage, the dashboard flagged all services as running with a "green" status check because the dashboard itself was dependent on S3.  To keep false status updates from embarrassing the company in the future, they have made changes so that the next time S3 goes down, dashboard status updates should function properly, i.e. show them as down or marked as "red."

We’re Sorry

Beyond the post mortem and system corrections being made, Amazon offered an apology to those who were affected by the outage, stating:

"We want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further."

##

About the Author

David Marshall is an industry recognized virtualization and cloud computing expert, a seven time recipient of the VMware vExpert distinction, and has been heavily involved in the industry for the past 16 years.  To help solve industry challenges, he co-founded and helped start several successful virtualization software companies such as ProTier, Surgient, Hyper9 and Vertiscale. He also spent a few years transforming desktop virtualization while at Virtual Bridges.

David is also a co-author of two very popular server virtualization books: "Advanced Server Virtualization: VMware and Microsoft Platforms in the Virtual Data Center" and "VMware ESX Essentials in the Virtual Data Center" and the Technical Editor on Wiley’s "Virtualization for Dummies" and "VMware VI3 for Dummies" books.  David also authored countless articles for a number of well known technical magazines, including: InfoWorld, Virtual-Strategy and TechTarget.  In 2004, he founded the oldest independent virtualization and cloud computing news site, VMblog.com, which he still operates today.

Follow David Marshall

Twitter: @vmblog
LinkedIn: https://www.linkedin.com/in/davidmarshall
Blog: http://vmblog.com