Life with a large cloud: Lessons learned
March 1, 2012Many companies are just getting started with cloud computing, but others are already well entrenched. Indeed, some organizations in this latter camp have built massive private clouds that support significant portions of their operations.
These giant, multi-petabyte clouds present unique management challenges for CIOs and business leaders, including ensuring adequate security and service levels, and positioning the new cloud offering in such a way that internal customers want to use it. And because so much about cloud services is relatively new, organizations are learning to deal with the hurdles even as they deploy these mega-clouds. The biggest challenge in implementing private clouds is the massive culture and operating model shift from a ‘do-it-for-users’ model to a ‘user self-service’ model."…
"To me, the biggest challenge in implementing private clouds is the massive culture and operating model shift from a ‘do-it-for-users’ model to a ‘user self-service’ model," says Frank Gens, senior vice president and chief analyst at research firm IDC in Framingham, Mass. "The entire IT service delivery model — from design through deployment and operation, and on through support — all needs to be overhauled."
Big-cloud adopters are dealing with these and other hurdles as they implement and use service-based computing infrastructures.
Integration with legacy systems
Enterprises aren’t moving to an all-cloud environment overnight, and the integration between private clouds and the legacy infrastructure still in place is a key issue.
BAE Systems Inc., an Arlington, Va., defense and security company, operates a multi-tenant private cloud on behalf of its government and military customers that encompasses multiple petabytes of storage; the company declined to specify exactly how large its cloud is.
The company also uses a smaller-scale private cloud internally to develop and test systems that it’s building for customers, says Jordan Becker, a BAE vice president.
The cloud infrastructure that BAE Systems is building will gradually replace the legacy data centers that BAE and its customers currently operate. Integration and migration between the older and newer computing environments is an issue BAE has had to address.
"The private cloud deployment is relatively new and has not yet displaced the existing data centers," Becker says. However, he explains, the private cloud has helped "slow the growth of the legacy data centers. As the legacy data center infrastructure approaches its natural capital refresh cycle, the infrastructure will be displaced incrementally with new cloud infrastructure. This process will take several years."
During this transition, "we need to elastically extend that legacy data center to enable the applications already running to scale across the cloud transparently," Becker says. "It should look to the user as though it’s one virtual infrastructure" from both an applications and management standpoint, he explains.
To achieve that integration, BAE Systems has created a common global namespace — a heterogeneous, enterprise-wide abstraction of all file information — for all of its image data. The data includes two-dimensional images, files with stereo sound and files that include full-motion video. There is also meta-data that goes along with these images, Becker says.
"The global common namespace is unique to each particular customer group," he says. "One such customer group that shares a common namespace [is made up of] users of geospatial information across several defense and intelligence agencies."
By employing this common namespace for the private cloud that reaches across legacy data centers and file archives originally developed for stand-alone applications, customers can access and federate information with peers they want to collaborate with in a seamless manner, Becker says.
Security and service continuity
The University of Southern California (USC) in Los Angeles operates a 4-petabyte private cloud that supports the institution’s USC Digital Repository. Security is one of the biggest considerations.
The Repository provides clients with digital archives of content such as high-definition videos and high-resolution photos. Services include converting physical or electronic collections to standard digital formats for preservation and online access. The Repository also offers high-bandwidth file management capabilities to access, manage and manipulate the large digital collections.
In November 2011, USC contracted with Nirvanix Inc. to deploy more than 8 petabytes of unstructured data on a Nirvanix private cloud that the vendor is managing as a service from within USC’s central data center as well as at the vendor’s own facilities. This includes 4 petabytes in the USC data center and 4 petabytes at an out-of-state location to mirror the data.
"Nirvanix gives us a full managed cloud in both places, so I don’t have to have staff familiar with their architecture or systems," says Sam Gustman, executive director of the Digital Repository and CTO of USC’s Shoah Foundation Institute. "They are responsible for all upgrades and maintenance. Even though I don’t have to operate the storage, I get the benefit of having the storage on our local network for access."
The cloud, which has the capacity to grow to 40 petabytes of storage, encompasses digital content from multiple USC entities.
USC is also leveraging the cloud for its own internal data storage needs as well as making it available to internal clients, Gustman says. He says the university opted for a cloud approach because it gives USC a geographically diverse and cost-effective way to store, preserve and distribute content on a global scale.
Protecting data from breaches was one of the major factors USC considered when it selected a storage vendor, Gustman says, and the university ensured that Nirvanix had in place the security technology and policies that met the institution’s standards.
Nirvanix has a set of policies it uses for data in the cloud, which includes encrypting data in transit and at rest, Gustman says. "It only decrypts when leaving the cloud," he says. "They also let our security team manage the keys to the data. Basically, every one of our 800,000 video files has a computer-generated password that we get to manage."
Matching security policies
"The hardest part of this [cloud endeavor] is security policy; making sure that the service company matches our own security policies," Gustman adds.
In addition to strong security, USC wanted to ensure it had an effective business continuity and disaster recovery strategy in place.
This "geo diversity" — having data stored in multiple locations — ensures that the university can continue to provide services from the cloud even if one site experiences down time, Gustman says.
USC keeps two copies of the data stored in its onsite database, one on a Nirvanix disk that’s managed at USC and another located in a different state on the Nirvanix cloud. "Not only can we track that the bits are staying exactly as they should through the multiple copies, but we can ensure that we will have copies under almost any circumstance," Gustman says.
Data security is also a major issue for BAE Systems, especially considering that many of its clients are involved in military or intelligence-gathering operations. "You have systems that source data at multiple levels of security classification that have to be certified and accredited by each agency," Becker says. "Data assets must be replicated and transferred with the proper classification levels."
In addition, BAE Systems’ government customers must comply with a number of federal regulations that govern how certain types of information should be protected, who should have access to data and so forth.
Security policy is established by the government customers and then enforced by systems, such as firewalls and XML gateways, that interpret and implement those policies based on a set of complex rules, Becker says. There are secure firewalls and gateways that manage multiple levels of security on BAE’s managed systems and on the secure government networks that these systems interconnect over, Becker explains.
Managing scalability and speed
Yahoo, in Sunnyvale, Calif., has built a huge private cloud that encompasses many thousands of servers around the world, more than 200 petabytes of data and some 11 billion Web pages, and supports much of the company’s operations, including its online search and news services.
"I would venture to say we have one of the largest private clouds in the world," says Elissa Murphy, vice president of product development at Yahoo. The company defines the cloud as a series of shared services that nearly every one of its properties uses. For instance, "most of the data you see on a page is requested and pulled from our private cloud," Murphy says.
What drove Yahoo to operate much of its business in a cloud environment was the need for extreme agility and speed. "Any company that’s running at the scale that Yahoo runs at, supporting over 700 million users, needs to build new applications and serve up pages as quickly as possible to ensure the best user experience," Murphy says. She says the company’s private cloud delivers data at speeds not rivaled by public clouds — at least not yet.
One of the biggest challenges of managing the cloud is being able to quickly scale systems up and down. "For example, when a breaking news story occurs, we can quickly shunt workloads, moving lower priority workloads — batch processing, for example — off the servers and dedicate them to the news spike," Murphy says.
She says the company can quickly scale using technology developed internally, and adds that, in a large cloud environment, it’s critical to carefully manage this process.
Keeping data consistent across multiple regions to ensure a consistent user experience is a big problem for scalability. "You have to essentially ensure each copy of the data around the world is consistent and ensure that a user’s privacy is maintained," Murphy says. "That introduces a large number of issues when you have products that span the world." One measure the company takes is to ensure that data is not copied to regions where there are different privacy restrictions in place.
From a technology perspective, Yahoo has built some of the first and largest NoSQL data stores in the world, many of which the company uses to keep data consistent across regions.
To address scalability as well as system and application reliability, Yahoo uses Hadoop — to which the firm has contributed more than 70% of the code, Murphy says — extensively across its enterprise. The Apache Hadoop project develops open-source software for scalable, big data computing. The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. The framework is designed to scale up from single servers to thousands.
Service speed is another critical issue for a company that lives on the Internet. "We actually have an SLA [service-level agreement] for speed," to guarantee how quickly Yahoo serves data up to its customers, Murphy says. "We are in the very low milliseconds consistently."
To address the need for speed, Yahoo uses some of the most advanced hardware available, including solid state drives with much faster read/write times than hard drives. "We also employ the use of caching throughout the infrastructure to ensure that data is always served as quickly as possible," Murphy says.
Training and other personnel issues
Much about the cloud is new to people inside and outside IT organizations, and there’s a learning curve associated with moving to private cloud environments.
As a result, effective training is one of the challenges enterprises need to address when operating massive clouds for themselves and their customers.
"A lot of [technology] training and systems administration training issues go along with managing this type of solution," says BAE’s Becker. "People have to be well versed in knowing how virtualization systems work, how to configure systems for failover, how to do backups and provide business continuity."
BAE Systems has training programs that prepare its government agency customers for the cloud infrastructure as well as application support issues. Training in cloud-based offerings such as software-as-a-service (SaaS) is particularly important because the concept is still new to a lot of users and managers, Becker says.
The company trains its application developers to support what end users might be looking for in SaaS applications that they will access via the cloud, self-service concepts, data access, security, etc.
"The end users are often soldiers or analysts — not IT people — who have to be able to use these systems," Becker says. "We have a whole training practice around how to build applications so that they’re intuitive for these users." In cases where BAE Systems provides the applications that end-users operate, the company provides training on the applications and the end-to-end system, Becker says.
Adjusting to the big cloud environment
The cloud requires different ways of thinking about computing and managing resources.
In 2009, the National Aeronautics and Space Administration (NASA) began development of a platform-as-a-service (PaaS) capability that developers could use to operate a variety of Web applications. From that effort a separate project evolved called Nebula, NASA’s private infrastructure-as-a-service (IaaS) development project that focused on providing the scalable infrastructure needed to support PaaS-based services.
But it quickly became clear that the IaaS model could provide an attractive alternative to physical infrastructure for NASA projects that typically deploy their own servers.
"Today it can take months for NASA projects to procure new server resources, and even longer to install and configure them once they are received," says Raymond O’Brien, CTO for IT at NASA’s Ames Research Center. "Further, deploying physical servers sometimes requires specialized facilities and comes with the ongoing burden of operating, maintaining and replenishing physical assets."
IaaS provides an attractive alternative because equivalent server resources are instantly available and don’t come with the overhead of housing and supporting physical resources, O’Brien says.
Nebula has progressed through alpha and beta stages and most recently underwent a five-month evaluation by NASA’s Science Mission Directorate. During the alpha and beta stages, more than 250 NASA employees and contractors had Nebula accounts.
According to the NASA Web site, the Nebula team is working on the development and implementation of the Nebula Object Store, a Web services-based approach for storing and retrieving data created by applications that use object-oriented programming techniques.
Object Store will allow total storage capacity to be extended into the hundreds of petabytes if required, "something that is not out of the realm of future possibilities given NASA’s rate of data and information generation," the agency says.
NASA is now reviewing its private cloud project to determine its future path, O’Brien says.
"One of the biggest challenges for Nebula has been making the transition from a small development project focused on innovation to an agency service," O’Brien says. "There is a big difference between developing cloud software and operating a cloud."
Compounding this is the fact that the cloud model is still new "and there is no textbook on how to best position an on-demand private cloud service to gain broad adoption internally," O’Brien says.
Adding to the challenge is the fact that the cloud model might not be completely understood by internal users, and the organization must formulate a new business model for private cloud service and integrate the use of cloud services with existing policy and practices.
"Launching a new cloud service can be a lot of work," O’Brien says.


