Enterprise Clouds and Swiss Cheese

July 12, 2012 Off By David
Contributed Article by Carlos Escapa, CEO of VirtualSharp Software
CloudCow Contributed Article
 

Enterprise Clouds and Swiss Cheese

 
 The cloud is a very complex machine with lots of moving parts. The whirring and humming of modern data centers are caused by fans, air conditioners and power converters, but the “moving” parts are in fact pieces of solid state deep inside the cloud. We are referring to the huge number of software-controlled virtual components that work together to create the computing environment that we know as the cloud.

Computing clouds have become some of the biggest and most complex machines ever made by man. For every physical component in a cloud, there are easily 10 virtualized components corresponding to all layers where virtualization is present – including storage, networks, CPU, memory, operating systems and even applications. The entire set can easily run into the hundreds of thousands of components in an enterprise cloud, and tens of millions in a public cloud. By way of comparison, the space shuttle had 2.5 million parts1


Service delivery in the cloud relies on the continuous and accurate functioning of a vast number of parts. IT managers make use of layers of abstraction to monitor the operational condition of the cloud. Each of those layers has some autonomy and resilience built-in; for instance, storage arrays have sufficient redundancy to tolerate the failure of one or more disks, and load balancers can redirect traffic during periods of affluence.

And this is where James Reason’s Swiss Cheese model, devised in 19902, begins to apply to the cloud. Reason explained that aviation catastrophes were caused by a set of unlikely events happening in sequence, such as an airplane sensor failing, followed by a pilot error and then an unusual atmospheric disturbance. The airworthiness of the machine would not be irreversibly degraded by an individual event, but when “holes” are aligned in the different layers the result is that a highly undesirable outcome can slip through.

Can this happen to computing clouds? Of course. It is happening all the time – search Google for “cloud outage” and you will see. Consider for instance the following set of events:

On their own, each of those events would not trigger a service outage. Bad patches, errors in replication, backup job failures, people on annual leave – these are mundane circumstances that can be handled by automated processes or human intervention. However, when they happen in succession, the holes of the Swiss cheese get aligned and the outcome is extended outages and severe data loss.  

What is important about that model is that it is equally applicable to a spacecraft as it is to a modern IT infrastructure. Clouds have tens to hundreds of thousands of virtual components being constantly created, deployed and discarded. During their lifecycle, components may change weekly, daily or even hourly. An efficient cloud will be one that reconfigures itself quickly based on business policy, and keeps resources aligned with corporate needs. And while clouds are multiplying benefits, they also have a soft underbelly: they are getting so big so fast, and their internal dynamics are so complex, that the chances of a severe malfunction – such as outages and data loss – keep increasing.

To harness the power of the cloud, enterprise IT will need to rethink its approach and need for automation. Ensuring service levels and supporting business velocity will require new IT governance models where automation will be augmented in ways not even imaginable in the past. Complex processes that today require dozens to hundreds of subject matter experts (think about ethical hacking, scalability tests, DR exercises … ) will be orchestrated automatically, across clouds. All this will require a new, cloud-oriented operating model and will yield a much stronger, flexible and reliable IT.

##

[1]The Shuttle Propulsion Systems, Jody Singer, NASA Shuttle Propulsion Office, April 2011
[2]Human Error, James T. Reason, Cambridge University Press, October 1990

About the Author

Carlos Escapa is co-founder and CEO of VirtualSharp Software where he works closely with customers on next generation Business Continuity and Disaster Recovery solutions for private and public clouds. He has extensive expertise in virtualization following a successful career at VMware where he was a senior executive in Europe. Under his leadership, VMware’s customer base grew to 4,500 customers and VMware’s channel base reached more than 300 certified companies, including partnerships with IBM, HP, Dell and Accenture. Prior to that, Carlos was Vice President of Channels EMEA at CA Technologies and Senior Director of Marketing in Japan at Sterling Software. Carlos has a Bachelors degree in Computer Science from Illinois State University, and a Masters degree from Virginia Tech.