Public-Cloud Lessons Learned After Dropbox Outage

January 15, 2014 Off By David
Object Storage

Grazed from NewsFactor. Author: Jennifer LeClaire.

The sky didn’t fall but the cloud Relevant Products/Services was dark over the weekend as Dropbox faced service disruptions that angered many users. The company reported its online storage service went down on Friday evening during scheduled maintenance and was back up and running about three hours later, with core service fully restored by 4:40 p.m. PT on Sunday. So what happened? And what can we learn from the outage? Akhil Gupta, head of infrastructure at Dropbox, offered his insights in a blog post Sunday.

Gupta said Dropbox relies on thousands of databases to run — and each database has one master and two slave machines for redundancy. The company performs full and incremental data Relevant Products/Services backups and stores them in a separate environment. The trouble came during an operating system upgrade to some of Dropbox’s machines…

What Really Happened?

"During this process, the upgrade script checks to make sure there is no active data on the machine before installing the new OS," Gupta said. "A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted, which resulted in the site going down."…

Read more from the source @ http://www.newsfactor.com/story.xhtml?story_id=032000J3LVQ8