Minimizing public cloud disruptions

November 2, 2011 Off By David
Object Storage
Grazed from TechTarget.  Author: Geva Perry.

Cloud computing depends on connectivity and availability, but those can be far from perfect with some providers. Disruptions to a public cloud can impair the productivity of an entire organization, unless data center administrators know what to do when it happens—and it will happen. In this tip, you will learn how to choose a public cloud provider and how to best respond to outages…

Learning to accept cloud crashes
Cloud disruptions are a fact of life, even among the most common providers. Amazon Web Services had two major outages in 2011—one in April in its Virginia data center and one in August in its Ireland data center, which was literally hit by lightning. Both crashes caused major disruptions to hundreds of Amazon cloud customers. What followed was an inevitable debate on the viability of using a public cloud for mission-critical applications.

Although these types of major disruptions get the attention of the media and the blogosphere, there is a bigger issue that is ongoing about the reliability of a public cloud, such as Amazon EC2—they are not designed to be reliable in the first place. Instances on Amazon’s cloud crash all the time.

Major catastrophic crashes and ongoing instance failures underscore that ensuring high availability in the cloud is more complex, or perhaps simply different, than ensuring high availability in your on-premise data center. It requires multiple strategies and trade-off decisions.

Public cloud failures at different levels
It’s important to understand that failures in the cloud can occur at different levels, according to George Reese, chief technology officer at enStratus Networks Inc., which offers infrastructure management services to major cloud providers. Reese describes those different levels as “the five levels of redundancy:”

  • Physical machine level
  • Virtual machine level
  • Availability zone level
  • Regional/data center level
  • Cloud provider level

Not all public cloud providers offer all of these levels. For example, very few—if any—cloud providers besides Amazon offer availability zones, which are data centers within a single geographical location that are insulated from each other. The idea behind availability zones is that if one data center fails, it won’t drag the others down.

But even though Amazon does offer availability zones, it does not provide visibility into or control of the physical machine level. The rule of thumb about these levels of redundancy is that the lower you go down this list in an attempt to create a redundant highly available application, the more reliable it becomes. But it also becomes more complex and expensive. Let’s review that trade-off in more detail:

 

Incorporating redundancy
Redundancy at the physical machine level is a familiar concept and a practice in many traditional data centers. Because of that, it is the least complex and the least expensive option. Many public cloud services offer this type of redundancy automatically and as part of the standard price of the service. For example, Amazon’s Elastic Block Storage service automatically replicates all data to a separate physical machine.

Similarly, at the virtual machine level, there are many known practices—and available commercial and open source products—for maintaining high availability through load balancing, replication and fail-over. This is true both for traditional data centers and public clouds.

Designing for high availability at the physical machine level and at the virtual machine level is familiar to Web application developers who now have established best practices of building these types of distributed apps. This same practice would be a challenge, however, for existing legacy applications that were designed in a more monolithic, client-server architecture. It is for this reason that these types of applications are more suitable for solid Infrastructure as a Service clouds, such as Verizon’s Terremark Worldwide Inc.,

Bluelock or Virtacore Systems Inc. These applications are not suitable for liquid clouds such as AWS or GoGrid.

All that said, there is still some complexity involved in designing applications for high availability in volatile environments such as Amazon EC2. It puts a big burden on developers.

This is where Platform as a Service vendors such as Salesforce.com’s Heroku Inc. come in. They run on top of Infrastructure as a Service (IaaS) providers such as Amazon and promise to handle many of the complexities of running applications in dynamic environments.

Two kinds of IaaS public clouds
Two basic models of IaaS public clouds—called liquid clouds and solid clouds—are emerging, each with an almost diametrically opposed philosophy behind it.

Amazon EC2 is the epitome of the liquid cloud. It was built with the philosophy of unreliable hardware and reliable software. In other words, expect the infrastructure to fail and fail often and design your applications to deal with it.  
Solid clouds follow the philosophy of reliable hardware and unreliable software. They are offered by companies such as Bluelock and Verizon’s Terremark Worldwide Inc. They use expensive proprietary hardware and are suitable for legacy applications that were not designed for frequent hardware failures.

  Liquid Solid
Alternative names commodity, webscale, “design for failure” Enterprise, legacy, traditional
Hardware cheap, unreliable, commodity expensive, reliable, proprietary
Isolation public, shared private, dedicated
Provisioning minutes, self-service, API hours/days, professional service
Automation high low
CloudOS, platform open source typically VMware
Customer acquisition and onboarding low-touch high-touch
Environment homogenous heterogeneous
Price $ $$$

High availability across data centers
At the availability zone level, things get a little more complicated but are still within the realm of expertise of many developers. This is because Amazon, possibly the only public cloud provider that offers the availability zone concept, provides various tools for maintaining HA within availability zones.
One such tool is Amazon EC2 Elastic IP addresses, which mask specific instances and availability zones. They also programmatically remap IP addresses to instances on other availability zones in case of failure.

That said, data center admins still need to maintain multiple copies of various components of the app, which adds to costs. And although Amazon has said there is no single point of failure among availability zones within a single region, this has already been disproven in both the April and August outages. 

Moving down the levels to regions and cloud providers, things get significantly more complicated and expensive. For one thing, the connection among these goes through the much less reliable and higher latency public Internet.

So the logic that addresses the move between these data centers will need to be much more sophisticated and will have to address a number of scenarios, especially to prevent data and application state inconsistencies. And although Amazon does not charge for data transfers between availability zones, such communication across regions and outside cloud providers may become costly.

It can become even more complex when attempting distributed architectures across cloud providers—for example, Amazon and Rackspace—because each uses different APIs and other management approaches.
Third-party vendors in the fast-paced cloud computing business are coming up with solutions to these issues. Cloudsoft Corp., RightScale Inc. and enStratus are examples of companies that offer various multicloud solutions for application mobility and disaster recovery. Xeround is another company tackling the particularly sticky problem of relational database cross-cloud data mobility and disaster recovery. In the end, each organization will need to decide on the level of high availability each of its applications require. It then needs to make a trade-off decision, for example, on how much it is willing to invest to prevent the relatively rare occasions of availability zone, regional and public cloud provider failures.