When You Move To The Cloud, Plan For The Storms

August 31, 2012 Off By David

Grazed from Forbes. Author: Ashutosh Garg and Joshua Levy.

Cloud computing infrastructure plays a rapidly increasing role in business-critical online operations. A recent study of several million Internet users found that one third of them visited a website that uses Amazon Web Services (AWS) infrastructure each day. And for good reason. E-commerce websites, social and advertising networks, streaming video, and other so-called “Big Data” applications all benefit greatly from the ability to spin up large compute clusters easily and on demand. You can leave the work of setting up racks and servers to the cloud providers, pay only when you need it, and instead focus engineering resources on core competencies.

However, are there risks in your business being dependent on a single cloud provider? Recent failures at Amazon, such as a major thunderstorm-induced outage in June at Amazon’s northern Virginia data centers, have left swaths of companies that depend on the cloud — including Netflix, Instagram, Pinterest, and others — with hours of downtime. Last year also saw a more series multi-day outage for some Amazon customers. As the cloud becomes central to operations, companies must contemplate the consequences of delegating critical data center infrastructure to services that might just go away. Can an unexpected storm blow the cloud — and your business — offline?…

The space of cloud services is quite different from the landscape of traditional hosting and roll-your-own data centers. Although the underlying hardware is largely similar to what’s in traditional environments, cloud resources are packaged in a variety of easier-to-use forms.

The business risks of using these services depend on two key requirements: durability (protecting your critical data from loss) and availability (keeping your systems up and running). It is essential to evaluate the roles and failure modes of cloud services in your application, and how they affect durability and availability. The essential strategy to mitigate risks of data loss and downtime is to add redundancy and reduce dependencies wherever possible.

In fact, though they are sometimes overlooked in the outcry following outages, techniques to achieve high levels of reliability do apply to the cloud. Amazon Web Services even offers tools to help, including regions, availability zones, and multiple types of storage services. For example, the critical cloud-based services of many companies (including our own) had no downtime during the incident in June, mainly because of the use of multiple, redundant availability zones. Similarly, with less critical services, you should be able to reprovision from backup storage after a short but acceptable outage, even when the primary storage has been lost.

Does the shift of businesses to the cloud mean less reliability? Bottom line — we don’t think so. But there are key differences. First, as more companies come to rely on the same cloud providers, it means that when there are failures, they can have increasingly broad impact on the Web, across many sites and services. Secondly, when using cloud services, you have much less internal, technical visibility and control. AWS and other cloud hosting providers generally cannot reveal detailed internals of their services, for security and business reasons. Effectively, this means that you can’t be sure when or how individual failures will be corrected when things go wrong. Instead, you need to plan for alternatives should these components or services fail, using both using redundancy within the main cloud provider, and, when possible, fallbacks to alternate service providers.

In case your business does use AWS, here are few technical best practices for riding out the next storm:

Design for failures: For high availability, always use multiple availability zones. In our experience, a large proportion of companies’ AWS outages result from using a single availability zone. Sometimes, this can even be done at no extra cost (for example, by placing a database master and slave in two different zones).
Back up your data: Historically and by design, Amazon’s S3 storage service has a very good record of durability. Other storage layers are less durable, so should always be backed up with snapshots, which are persisted to S3.
Monitor health: Internal and third-party monitoring services should continuously validate that your systems are working, so you can respond quickly.
Have a disaster plan: While availability zones do provide additional availability, consider fallback plans for full-region meltdown. This means instant or relatively rapid failover of your services to a new deployments in another AWS region or hosted with another service provider. Typically this is best handled by a global DNS provider with failover capability, like UltraDNS, Dynect, or DNS Made Easy.
Keep your options open: For longer-term stability of your cloud infrastructure, avoid deep tie-in to any particular service offered by one provider. For example, consider mechanisms or third-party services that allow you to deploy servers in AWS as well as other cloud service providers, like Rackspace, Softlayer or GoGrid.