Top five cloud outages of 2011

December 20, 2011 Off By David
Grazed from Cloud Pro.  Author:  Jennifer Scott.

The public cloud has many benefits. The instant access to extra storage and compute helps in times of need. The pay as you go model means you only splash out for what you use. The lack of license keeps you free and easy to move your data.

But, the one thing we all worry about is an outage. It doesn’t matter how good the service is if you cannot access it and it still prevents numerous companies from putting mission critical data or applications out there.

We take a look at the top five cloud outages this year and keep our fingers crossed the offenders learn from their mistakes…

In at number five…

Google Docs September outage 

We at Cloud Pro love the convenience of Google Docs. We often use it when we are out on the road, covering cloud computing stories from conferences and events, or typing up an interview we can then easily share with those back in the office.

However, on 8 September, things came to a halt, infuriating numerous customers of the internet giant.

The word processor only faltered for an hour and due to it falling over after 10pm in the UK, we were largely unaffected. Yet, it was smack bang in the middle of the US working day and hit a lot of companies who needed access to their files.

The major fury though was not down to the outage itself but the lack of explanation. Even when Cloud Pro spoke to Google, the firm didn’t often any information, only an apology.

Lesson to learn: Keep your customers informed and explain any outages. It will avoid a lot of anger later on.

Rolling up in fourth place…

Google Docs October outage 

If one wasn’t enough, Google yet again wound up its customers, just a month after the last outage at Google HQ.

The word processor fell down again, specifically in San Francisco and Budapest, meaning the issue affected fewer users but in more specific regions.

Although again, it didn’t take long for the company to fix the problems, Google still didn’t explain what had caused the outage.

It merely said to Cloud Pro: “Please rest assured that system reliability is a top priority at Google and we are making continuous improvements to make our systems better.”

Lesson to learn: One mistake can be forgiven but make it twice and the mud might stick.

 

And third place goes to…

Amazon’s East Coast outage

Yes, it is not just Google that suffers from cloud outages. The other big name in cloud has its issues too.

Back in April, Amazon’s Elastic Compute Cloud (EC2) went down on the East Coast of the US. Maybe as it was only a specific region affected, you might think it doesn’t deserve third place in our hall of shame, but it is the companies it brought down with it that made the outage notable.

A lot of famous companies were running on this EC2 region at the time, including Reddit, Hootsuite, Quora and Foursquare, meaning the effects of the outage were felt much further a field than just a section of the US.

But a further 170 smaller players were taken out as well, making their businesses pretty damn difficult to operate that day.

The site was down for more than eight hours, with regular updates from Amazon explaining the issue – a network error followed by an overload of re-mirroring to the east coast site – but it still left numerous companies flailing around with no way to fix their IT problem.

Lesson to learn: Careful not to take down your well-known customers. They might be a great case study when things go well, but they sure make a lot of noise when things go badly.

Pipped to the post in second place…

Office 365’s August/September outages

Microsoft made a rather large song and dance about the launch of its cloud productivity suite this year. With large scale events, huge ad campaigns and interviews splashed all over the media, it was taking a big bet on its cloud version of Office.

However, to create such a hoopla around Office 365, just to have an outage a few months down the line, brings the heat right back on you.

This is why we have merged the two outages together here. At the end of August, the suite went down for two hours just in the US, but just two weeks later saw a global outage with DNS servers taking the blame.

The issue wasn’t just that the service had only just launched, but also the chaotic nature of keeping users informed. Twitter feed claims were corrected by a blog, which in turn was corrected by Twitter and the back-and-forth continued for hours, whilst users were unable to access their much need files.

Lesson to learn: When it comes to media attention, you have to take the rough with the smooth, so have a plan for any outages (and try not to have one so soon after launch).

Top of the flops in first place is…

Amazon and Microsoft Dublin lightning strike

We couldn’t put any other outage in first place than one caused by ‘an act of God.’

Dublin has become a bit of a Mecca for US companies wanting to offer their cloud services to Europe, helping adhere to regulations and giving the reassurance of being near-by.

However, the city is not known for its fine weather, and back in August a major storm caused both Amazon’s and Microsoft’s data centres to be taken out by lightning strikes.

Both companies claimed the storms had led to power failures and, in turn, server failures, leaving European customers with limited – if any – access to EC2 or BPOS for an entire weekend.

Failing on a weekend was a stroke of luck for the companies involved, but if the services had been down for two days in the middle of the working week and businesses were running any mission critical applications in the data centres, the reaction could have been a lot worse.

Lesson to learn: No matter how hard you try, outages will happen. Just remember to deal with them efficiently when you do.

But what should we learn from these outages? Should we never trust public cloud? No.

What we must remember is all of these issues that affected the big guns of cloud computing, even the lightning, could affect our own data centres. The difference is we probably don’t have thousands of engineers, fuelled by the company’s reputation being on the line, running around to fix the issues.

Have a back-up plan for definite, but being a billion dollar company can buy you a lot of experts…