Spanning’s Backup Service in the Cloud

November 28, 2011 Off By David
Grazed from GigaOM.  Author: Stacey Higginbotham.

Austin (Tex.)-based start-up Spanning has embraced the concept of cloud computing so much that its product is a backup service for Google Apps (GOOG), completely hosted and run from Amazon Web Services (AMZN). The idea of backing up one cloud service via another intrigued me, so I asked Mike Pav, Spanning’s vice-president of engineering, how he does it.

The company charges people or businesses $30 a year to back up Google Apps, including e-mail and documents. That covers customers who somehow delete or lose those files. Google will support users if it loses their data, but it won’t search for your files if you mess up…

Chief Executive Officer Charlie Wood is confident that Google won’t offer extended services such as backup, so he’s sticking with this. Still, he’s looking for new lines of business as the company grows. Spanning is seeking its next round of funding after having raised a $2 million Series A round in April.

Building a cloud-based backup for a cloud service requires a lot of planning for worst-case scenarios. Creating a backup service in Amazon Web Services hasn’t been done, says Pav, as he explains some techniques he’s developed. “For example, a single point of failure for us was our database, but we just finished up a big project to partition our database,” Pav says. “We have to focus on the path and not the destination, because as far as scalability is concerned, we’ll never be done. That’s our real barrier to entry.”

Spanning adds terabytes of storage each month, using Amazon because it makes automatic scaling seamless. “It would be terrible if we had to rack our own drives into an array to deal with that,” Pav says. Spanning stores all the content on S3 to guarantee high reliability, but getting data to S3 can be slow, so Spanning uses parallel access, which helps address the speed of S3 while providing scalability and reliability.

Relentless Backup

Spanning uses Amazon SQS to queue work to a pool of virtual resources that grows and shrinks, based on load. Pav’s team has set up Spanning’s application to track the incoming flow of data to EC2 and make sure that each time the system is about to back up new content, it checks to see if the EC2 instance is about to shut down. If it is, the in-progress backup re-queues its work in progress so another server can pick up this work when AWS adds another server from the pool. This way, the backup doesn’t have to start again.

This is important when dealing with sets of data that can be huge. Pav says Amazon offers various models for queue management, but simplicity and scalability work best for Spanning. “When you’re dealing with large data sets for a large number of users, you can’t afford to do anything twice.”

Spanning uses Amazon Relational Database (RDS) for its persistent database storage, although it imposes limitations on how much data Spanning can store and how much throughput it can support on any single database. Pav admits that this limits his partitioning strategies, but it enables him to focus on building his own data store, rather than support.

“We want to get out of the business of spending time managing these things. We can solve this problem at the application’s architectural level to make sure it scales,” he says. “RDS may not be the highest-performance option, but we are able to reduce investment into something that’s not core to our business and by making good application-level architectural decisions, we can render the RDS performance issue moot.”

Pav says Amazon has not only changed the economics of building an IT service, it has helped make Pav’s product better and faster at lower cost. Pav notes that the service’s reliability lets him deploy new code when features are ready—often in the middle of the day, when his team is fresh. This is a big shift from the old practice of waiting until late at night, when fewer users are expected to suffer online disruption. Then again, a large customer base all over the world means that in today’s distributed world, there is no more middle of the night.