The Utility of Cloud Computing

November 15, 2011 Off By David
Object Storage
Grazed from Bio-ITWorld.  Author: Kevin Davies.

In many ways, cloud computing has become an ever-present commodity since Bio•IT World published a special issue on the subject (Nov 2009). In September 2011, we held our first standalone conference on the topic. Experts—users and vendors alike—gathered for two days of sharing insights and progress. The takeaway was that more and more users were comfortable with the flexibility, cost, and even security afforded by the cloud. And while Amazon’s omniscient cloud capabilities were a recurring theme, what was even more impressive was the growing ecosystem of commercial and open-source initiatives offering a host of cloud-based services and applications…

Amylin Pharmaceuticals’ Steve Philpott was one of the first biotech CIOs to enthusiastically embrace “the big switch” to the cloud. “Our IT cost infrastructure is 50% less than when we started [in 2008]. My CFO really likes us!” Philpott said. “The box has disappeared… Do you care where your last Google search came from? No.”

But the cloud still raises doubts and concerns. A major issue, discussed below, surrounds security and regulation. Another is whether the cloud can truly handle the masses of next-gen data that is being generated. “Can the cloud satisfy the requirement or can the requirement satisfy reality?” asks Eagle Genomics CEO Richard Holland. “Sometimes it’s not the cloud at all but the Internet—the whole concept of the network and trying to transfer that amount of data. Do you really need to be transferring that much data in the first place? I think people are generating so much data now, and they’re expecting to do with it what they always could, and that just might not be possible. You might have to rethink the whole paradigm of how this works.”

From A to ZWhen it comes to the cloud, it boils down to infrastructure-as-a-service, and that means Amazon, “pretty much by default,” says Holland (see, “Eagle Eye on the Cloud”). That said, there are many other players, including Rackspace, Penguin, GoGrid, Nimbus, and Eucalyptus (open-source).

Johnson & Johnson’s John Bowles said that the Amazon environment was “an eye-opener in terms of infrastructure-as-a-service… Seldom mentioned in big pharma is the “opportunity cost.” If it takes six months to get a machine through CapX, there’s no cost for that time.” J&J’s tranSMART knowledge base (see, “Running tranSMART for the Drug Development Marathon,” Bio•IT World, Jan 2010) went live two years ago. “Without the cloud environment, we’d still be arguing!” said Bowles.

According to Amazon Web Services (AWS) senior evangelist, Jeff Barr, AWS is like electricity—a utility that you pay as you use. “On demand, run by experts,” he said. With its roots tracing back to the 1960s (commodity computing, mass-produced computers), Barr said: “We’re past the innovators and early adopters—we’re at the early majority point.”

The advantages of the cloud are well known by now: no capital expenditure, pay-as-you-go, elastic capacity, and (in principle) improved time to market. “You can iterate and cycle more quickly. People love this elasticity,” said Barr. Trying to predict demand using a terrestrial data center is notoriously tricky, and inevitably leads to either an “opportunity cost” (compute power laying idle) or an inability to serve customers (demand exceeds supply). “The cloud can scale up or down, so the infrastructure matches the actual demand,” he said.

The Amazon cloud is spread across six geographic regions: the US (East and West Coast), Singapore, Tokyo, Europe, and one reserved for the U.S. federal government. Users have full control over where their data are stored and processed. “If you have regulatory issues and your data must remain in Europe, that’s fine,” he said. In addition to paying on demand, users can buy spot price instances, the price changing minute to minute. “This enables you to bid on unused Amazon EC2 capacity,” said Barr. “You can use this to obtain capacity very economically.” A recently published example comes from Peter Tonellato’s group at Harvard Medical School, building a pipeline for clinical genome analysis in the cloud (see, “Genome Analysis in the Cloud”).

Barr was excited about Amazon’s new relational database service, which allows users to launch a new database in a matter of seconds. Compare that to a MySQL or Oracle database, Barr said, which might take a year to get up and running. This could allow users to offload common administration tasks—OS and database upgrades, backups etc.—to AWS. Other new initiatives include Cloud Formation and AWS Elastic Beanstalk (a simple way to deploy/manage applications). “We’ve moved from individual resources to entire apps, entire stacks, being programmable and scriptable,” said Barr.

A recent announcement by DNAnexus (see p. 41) revealed that a mirror of the NIH Short Read Archive of DNA sequence data will be hosted on the Google cloud. BioTeam principal Chris Dagdigian says the platform differs from Amazon’s. “Google has a more integrated platform that you run on top of, while AWS offers infrastructure elements that can be assembled and combined in many different ways,” said Dagdigian. “If the Google/DNAnexus collaboration delivers an easy-to-use compute platform with integrated ‘big data’ support, it could be quite interesting.”
Dagdigian remains a fervent backer of Amazon’s cloud infrastructure. “VMware—that’s not a cloud,” he told the crowd in La Jolla. “If you don’t have an API, or self-service, or email to humans, it’s not a cloud. If you have a 50% failure rate, it’s a stupid cloud.”

Dagdigian sees “a whole new world” when it comes to moving high-performance computing (HPC) into the cloud. Instead of building a generic system accessible by a few groups, one can now stand up dedicated, individually optimized resources for each HPC use case. When it comes to hybrid clouds and “cloud bursting,” Dagdigian recommended a buy-don’t-build strategy. Data movement is a pain. “I’m a fan of open source, but if you’re doing it, buy rather than build,” he said. Companies such as Cycle Computing (see, “Cycle Time”) and Univa UD have happy customers, he said.

“You can’t rewrite everything,” said Dagdigian. “Life sciences informatics has hundreds of codes that will never be rewritten. They’ll never change and will be needed for years to come.” The future of Big Data, he said, lies with tools such as Hadoop and Map Reduce. “Small groups will write such apps, publish, open-source, and we’ll all plagiarize from them,” he said.

While many users want high availability and resiliency, Dagdigian said that “HPC nerds” want speed, “I’d pay Amazon extra if they’d guarantee servers in the same rack,” he said. “HPC is an edge case for obscene-scale IaaS clouds. We need to engineer around this. We have to know where the bottlenecks are.”

Dagdigian couldn’t stop raving about MIT’s StarCluster (“It’s magical,” he said.), Opscode Chef, and GlusterFS (now part of Red Hat), particularly for scale-out NAS on the cloud. And he said CODONiS was a promising start-up for storage and security. As for future applications of cloud computing, Dagdigian insisted that Amazon itself was “no bottleneck—it’s always the server or the Internet.” “Direct-to-S3” file transfer solutions from Aspera (see, “Aspera’s fasp Track for High-Speed Data Delivery,” Bio•IT World, Nov 2010) also looked very promising.

Leading the ChargeWith the need to maximize IT efficiency, Amylin’s Steve Philpott led the charge to rethink IT under considerable financial pressure (see, “Amylin, Amazon, and the Cloud,” Bio•IT World, Nov 2009). “We have access to tremendous capabilities without having to build capabilities—10,000 cores, new apps cheap,” said Amylin’s Todd Stewart. “Some tools—manufacturing, ERP, etc.—will always be in the data center. But let’s find process that will work in the cloud.”

With both campus data centers full, Amylin led to pilot projects, including more than a dozen software-as-a-service processes. Stewart noted that there was only one validated app in the cloud. “In general, that’s still something we’re struggling with,” he said. CRM and call center apps have moved over. And Amylin has used Nirvanix Cloud storage for two years—“a cubby hole on the Internet,” said Stewart. “We hope to go tapeless on backups shortly.”

“Chef is something we’ve had a look at,” says Eagle Genomics’ Holland. “As you scale up, it becomes less and less practical to use anything else, to be honest. You could write it yourself with a bunch of Python scripts, but someone else has done it, so why bother?!”

“We launched a 10,000-node cluster with one click [using Chef],” said Cycle Computing’s Andrew Kaczorek. Chris Brown, Opscode’s chief technical officer (and co-developer of Amazon EC2) said: “We’re software architects and system administrators. We’ve run Amazon.com and Xbox live. [We’re good at] automating infrastructure at scale.” The cloud, Brown said, is not necessarily cheaper than standard hosting. “Do you have the money, time, experience? What are you willing to pay for?” he asked. “We take the experience and plug it in for you. You want to manage 1,000 machines instantly. Google, Amazon—they have 100 people. Where can you find a team?”

Chef is many things: a library for configuration management, a systems integration platform, and an API for the entire infrastructure. “Our mantra is to enable you to construct or reconstruct your business from nothing but a source code repository, an application data backup, and machines.”
“Big pharma is concerned about auditing: now we can show a snapshot of cookbooks running on a node. We can reproduce and launch a cluster that mirrors what it was at that date and time. It is very powerful,” said Kaczorek.

More CloudsThe number of cloud applications and offerings is too many to list. Appistry’s new Ayrris Bio offering is poised to make a big impact on life sciences organizations (see, “Appistry’s Fabric Computing”). Assay Depot hosts fully audited, private marketplaces for pharma clients (see, “Assay Depot’s Cloud Services”). Complete Genomics selected the Bionimbus open-source community cloud as a mirror for a major genome dataset. The University of Maryland’s CloVR is a portable virtual machine launched on a desktop that can manage additional resources on the cloud (EC2, academic clouds) for large-scale sequence analysis.

The San Diego Supercomputer Center is rolling out a cloud, “a private data storage cloud to enable the presentation and sharing of scientific data, with rental and owner pricing,” said SDSC’s Richard Moore. It will have an elastic design with an initial capacity of 2 petabytes, although the emphasis will be more on access and sharing. Chemaxon’s David Deng presented a collaboration with GlaxoSmithKline (GSK) on cloud computing, in which 13,500 potential anti-malarial drugs have been made free available, hosted on EC2. Users access the data using ChemAxon’s Instant JChem database management tool, requiring no local software installation. Deng admitted that security was not a huge issue in this particular case, but added his colleagues are eager to set up other collaborations.

Former Microsoft executive and BioIT Alliance founder Don Rule wrapped up the La Jolla conference. “Cloud computing is a very powerful enabler despite the hype,” said Rule. “It’s an important enabler for personalized medicine.” Ironically, Rule is experimenting with EC2 to run an encrypted HIPAA-compliant database.