Best practices for mitigating cloud application outages
August 15, 2011In spite of the hype that a Cloud system or application will never fail, we are still seeing cases of cloud system failures. The recent ones being Lightining strikes in Dublin taking the Amazon, Microsoft Clouds down for a while. While this may cause some Fear, Uncertainty and Doubt on Cloud, the underlying fact remains that transforming an application to Cloud means not just setting a switch for the enterprises, but there needs to be lot more planning and the best practices that are proven in the traditional data center are still valid. The following are some of the best practices in prevent the Cloud outages. These are beyond the basic disaster recovery provisions given by the most cloud providers…
Avoiding a Single Point of Failure Across Tenants: It is a general trend that most of the Cloud applications tend to be multi-tenant in nature. Again as explained in my other articles, multi tenancy within an enterprise means, different geographical regions, business units or other acquired and merged entities. However, the way the load balancing, Web Server Scalability, Application Server routing and database partitioning should be done in such a way, that a single failure of a Cloud component like a database , application server should not make the all the tenants down. The database partitioning strategy plays an important role here.
Suppose in an enterprise ERP application hosted on cloud, and if the ERP application is logically separated by plants or warehouses, then ensure that a failure of a single Virtual Machine or data store does not shut down all the plants, but only specific plants. All the load balancing, routing and data partitioning schemes should adhere to the principle of avoiding total failure if a few virtual machines are down.
Utilizing the Out-of-the-Box Features of the Vendor for Availability: Typically most cloud providers provide you multiple choices to whether the disaster and outage scenarios. It is up the enterprises to evaluate and choose the best ones suited to their needs. Some of the typical options given by various vendors are:
- Multiple data centers across the zones: Most providers have their location in all continents or in major locations across the world. It is good choice to choose the scalability options across these locations to ensure that failure of a single location does not result in total outage of your application.
- Availability Zones: Though this a specific Amazon EC2 terminology, this concept is more about making certain servers and networks isolated from the failures of other parts within a particular geographical regions. Careful analysis of this feature and scaling out the application and data across availability zones would be a viable option.
Utilizing the Out-of-the-Box Features of the Backups: Most vendors do provide multiple choices for backing up the data automatically. However, it is up the enterprises to choose them to fit to their needs.
For example, why we use the Windows Azure Storage, All your content stored on Windows Azure is replicated three times. No matter which storage service you use, your data will be replicated on different fault domains thereby making it much more fault tolerant. Similar SQL Azure makes automatic backup of the database.
Similarly the EBS Storage units in Amazon do provide automatic options for replicating the data into the multiple servers within an availability Zone and options like S3 provide backup across availability zones.
Building a Custom Storage Backup Strategy: One of the major reasons for outage of applications is due to the reason that these applications fully reliant on the vendor provided automatic backup options. So if everything else fails, application owners have no options but to wait for the Vendor to restore their services.
Also vendor (cloud provider) backup options will not protect against application failures like data corruption, accidental or deliberate deletion of data and hence a custom application specific backup strategy is needed.
Most Cloud Services do provide many custom options too, for example if you use Cloud databases like Oracle RDS you have options like recycle bin and flashback database that can help to restore the database content to a specific point of time.
Another simple option which always worked effectively is to use the features like TRIGGER or Message Queues to replicate the transactions to a different server or regions. This will ensure that the all the important transactions have been backed up and making the restore option also easier.
Creating Copy Back In To the Data Center: No current enterprise is going to fully relinquish the data centers and do the business on Cloud, rather there will be a HYBRID delivery of a combination of data center, private and public clouds. In that scenario keeping a local copy of the most critical data is always a better option. Most Cloud providers do support such a scenario too.
For example with support from SQL Azure Data sync, we can replicate the data from Cloud back to the data centers.
SQL Azure Data Sync Scenarios:
- Cloud to cloud synchronization
- Enterprise (on-premise) to cloud
- Cloud to on-premise
- Bi-directional or sync-to-hub or sync-from-hub synchronization
Summary: Cloud has far reaching potential to enable the enterprises to concentrate on business capability needs versus operational and maintenance needs. Cloud also opens up new areas like High Performance Computing, Platform and Solutions as Service. Few of the initial outages should not create a fear, uncertainty and doubt in the minds of the enterprises.
It is all about the SLA needs of the individual applications and how we plan the cloud deployment. For example it’s almost impossible for today’s enterprises to suddenly provision a data center in a different continent and utilize for its disaster recovery needs. However most cloud providers allow for such a scenario as a simple self-service based provisioning.
It is up to the enterprises to evaluate the out-of-the-box as well as custom features against the SLA needs and come up with an appropriate strategy. This will make the Cloud Journey of the enterprises more fruitful.


