Final Thoughts on the Five-Day AWS Outage

April 26, 2011 Off By David
Object Storage
Grazed from eWeek.  Author:  Chris Preimesberger

Five full days after its largest outage hit on the morning of April 21, Amazon Web Services said it finally has restored virtually all services to its customers.

However, there still are a lot of smoldering IT managers who haven’t yet cooled off completely from the outage that started at 1:41 a.m. PDT April 21 at the AWS data center in Northern Virginia.

The mishap caused disruptions in its EC2 (Elastic Compute Cloud) hosting service, knocking thousands of Websites—including such popular ones as Foursquare, Reddit, Quora and Hootsuite—off the Internet. A limited number of customers still were reporting data being "stuck" in its EBS (Elastic Block Storage) service on April 25.

Income that AWS-hosted businesses lost during that one- to five-day window of time will never be regained. This was a serious business problem for hundreds, perhaps thousands of IT managers, who are now wondering whether to continue using the service.

"EBS is now operating normally for all APIs and recovered EBS volumes," Amazon reported April 25 on its status dashboard. "The vast majority of affected volumes have now been recovered. We’re in the process of contacting a limited number of customers who have EBS volumes that have not yet recovered and will continue to work hard on restoring these remaining volumes." The company said it will post a detailed incident report.

What are industry people saying in the wake of the mishap? What might be the long- and short-term results of an outage that shackled one of the sturdiest, most trusted Web services providers in the world?

Reaction from Far and Wide

Several AWS users commented with frustration on eWEEK stories covering the mishap. The blogosphere, as one might imagine, was rife with commentary.

"In short, if your systems failed in the Amazon cloud this week, it wasn’t Amazon’s fault," blogged O’Reilly Media’s George Reese. "You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon’s cloud computing model. The strength of cloud computing is that it puts control over application availability in the hands of the application developer and not in the hands of your IT staff, data center limitations, or a managed services provider.

"The AWS outage highlighted the fact that, in the cloud, you control your SLA in the cloud—not AWS."

Morphlabs was one of the first AWS solution providers when it launched Morph Appspace in 2007 and now has more than 4,000 users.

"The Amazon EC2 outage has sent ripples and shockwaves through the AP wires and blogosphere, but those of us who have been in the cloud computing trenches for the equivalent of tech eons (at Morphlabs, we’ve been at it for more than four years), the news is neither shocking nor a reason to stray from our mission," founder and CEO Winston Damarillo told eWEEK.

"While it is tempting to unleash common fears about new technologies when confronted with ‘proof’ of their failings and risks, our years of innovation and adoption tell us that there is a wiser path. Approach with caution, but approach nonetheless. The same is true for the implementation of cloud computing services in your IT organization."

Morphlabs’ approach to software development assumes failure, and it builds fault tolerance into all of its cloud computing solutions, Damarillo said.

Ed Laczynski, vice president of cloud strategy and architecture at Datapipe, a New Jersey-based provider of managed IT and hosting services that uses AWS for one of its offerings, told eWEEK that the AWS story "shows how important it is to think about engineering when you’re designing systems for the cloud."

"Those [enterprises] that hadn’t designed their cloud in Amazon for high availability suffered  in the regional zones that were affected," Laczynski said. "A lot of the hype around cloud is that it’s super easy, you just spin up servers, it just works, I don’t have to worry about anything, etc., that was broken, for sure.

"If you look at the documentation, best practices and so on of the people doing it [cloud] best, they’re all designing for failure [to happen]. For us, it was an opportunity to test that concept. Our customers that are deployed on AWS suffered only minimal disruption, if any at all, because we designed for it."

Lydia Leong of Gartner Research wrote in an advisory that Amazon EC2 didn’t actually violate its service-level agreement when the outage occurred.

"Amazon’s SLA for EC2 is 99.95 percent for multi-AZ deployments," Leong wrote. "That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating its SLA.

"Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was Elastic Block Store [EBS] and Relational Database Service [RDS] which weren’t, and neither of those services have SLAs."