Blame game over Amazon outage misses the point

June 25, 2012 Off By David

Grazed from InfoWorld. Author: Matt Prigge.

On June 14, the Amazon Web Services cloud computing platform experienced a serious outage [1] in its Virginia (U.S.-East) data center. Apparently power-related, the outage took down portions of one of the four independent availability zones that operate in that data center. As a result, many popular websites and a slew of less popular ones disappeared from the Internet for several hours.

As in previous outages of megascale cloud implementations from likes of Amazon and Microsoft [2], this incident triggered a round of hysteria about the future of cloud computing. Surprisingly, unlike the response to last April’s AWS outage [3], many rushed to Amazon’s defense. This could be a reflection of the fact that attitudes toward the cloud and its inevitable failings are becoming more realistic, or it could simply be that this month’s outage was far less widespread. In either case, anti-public-cloud pundits and competitors alike wasted no time in using this failure to underline why the public cloud is an incredibly bad idea…

Fear, uncertainty, and doubt that helps no one
As I’ve said before, I am still relatively shocked by the wild reactions that always seem to follow these highly publicized events [9]. One blog entry written by private cloud vendor Piston Computing particularly caught my eye. In it, Piston co-founder Gretchen Curtis opined that this most recent AWS outage was proof it’s better to own than to rent [10]. Although buying may indeed be better than renting in many cases, I lament the black-and-white nature of this post, and think it’s a great example of the FUD from self-interested entities (Piston sells data center technology, whereas Amazon rents it) that always seems to trail similar events and in the end serves no one well.

I won’t go point by point on Curtis’ post because I happen to agree with much of it — at least in the very large enterprise sphere that forms the sweet spot for Piston’s implementation of OpenStack [11]. But what irks me about it — and much of the other editorial commentary — is that the AWS outage doesn’t back up the claims Curtis made. Her points were valid, but they were unrelated to the AWS outage.

All data centers — on-premise or cloud — require disaster recovery
This most recent AWS failure is akin to a very serious, yet recoverable failure in a core infrastructure component in an on-premise data center — and I’ve seen it happen more times than I can count. If you operate a mission-critical infrastructure that can’t tolerate downtime [12], you probably have measures in place to protect your operations from extended outages were they to occur — a backup data center in another building or site, for example. If you haven’t invested in building that kind of redundancy [13], your organization has essentially decided the risk of downtime isn’t worth the time and money it would take to avoid that threat.

The exact same is true of the public cloud. Any modern IT system, regardless of what it is or who runs it, can and will eventually fail. That applies as much to on-premise infrastructure as it does for public cloud infrastructure. The tech we use today is simply far too complex not to fail.

What many — both proponents and detractors of public cloud offerings — seem to miss is that being in the cloud does not and will never free you from having your own disaster-recovery and high-availability measures [14] in place to defend against the failures and outages that will inevitably occur.

In an on-premise or private cloud infrastructure [15], that means deploying redundant core infrastructure hardware and maintaining a testing regimen to ensure it’s working. In the cloud, you may not be concerned with the hardware, but you need to diversify your workloads across multiple availability zones within a cloud provider or even across multiple cloud providers. Conceptually, it’s no different than what you do on-premise, although it may bear little resemblance in execution.

Of course, if you’re large enough to have the correct economies of scale, you may find delivering that kind of high availability coupled with the elasticity the public cloud offers may be cheaper and easier to do in an on-premise private cloud — and I believe that was the thrust of Curtis’ blog post.

The real issues: Getting the right tool for the job, learning from experience
That decision, however, is an issue of selecting the right tool for the job. Just as no one screwdriver is appropriate for every screw in existence, any of the public cloud, private cloud, traditional on-premise infrastructure, or hybrids of the three may end up being the right tool for you. The key to making a good choice is truly understanding the pros and cons of each approach and being able to match them to your needs — areas in which neither the breathless pro-cloud nor staunch anti-cloud narratives can really help.

That’s not to say I don’t appreciate a vigorous post-outage debate about what went wrong in a given failure and how (or whether) it will be avoided in the future. Though some public cloud providers are less than forthcoming with real details, at least we’re aware of the general cause and what was done to fix it.

How many widespread failures of on-premise data center tech (say, bad SAN firmware that leads to catastrophic failures and long downtimes) go unreported simply because nobody has the visibility into the thousands of systems deployed to correlate the failures? That’s one luxury that public cloud operators simply don’t have — everyone gets to see their failings — and, if they’re lucky, learn from them.