Takeaways From the Facebook and Foursquare Outages

October 9, 2010 Off By David
Grazed from GigaOM.  Author:  Gary Orenstein.

In the last few weeks, lengthy outages at Facebook and Foursquare left site users anxious and industry watchers curious about how web applications will maintain greater uptime. The more we rely on these services, the less we tolerate any interruptions, even if the service is free.

With such large public audiences to keep informed, most web companies publish detailed post-mortems on site interruptions. These are usually an attempt to appease impatient customers and share a lesson the company learned. The write-ups by Facebook and Foursquare provide a few business and technical lessons that we should note.

On the business side:

Customers expect companies to publish a post-mortem after something goes wrong. Transparency is a winning strategy, and companies that issue a quick, genuine apology and explanation earn the trust of their users and retain them through good times and bad.

Maintaining a status blog goes a long way. Foursquare launched theirs, as a Tumblr microblog, after their recent outage. Between that and a Twitter support address, they feel they have enough coverage on customer communications channels. The company’s responsiveness on these channels helps maintain customer loyalty.

On the technical side:

Pushing performance with caching always leaves you a bit exposed. Web companies use all types of caching strategies to keep data in fast memory as opposed to slower disk. But having a second copy in cache adds an extra variable that can catch you by surprise. In the case of Facebook’s outage, they had a system that tried to repair inconsistencies between the persistent storage and cache, but this health check itself failed, causing more harm than good. Caching isn’t going away, but the rise of flash-based storage solutions offers an opportunity to simplify the tiers.

Sharding stinks, but it is a necessary evil. In the case of Foursquare, their primary issue was an overloaded database instance. With large sites, companies spread large data sets across many smaller databases in a process called sharding. Sometimes they will split data by user ID: A through E in database 1, F through J in the next. When Foursquare found that one instance was being overloaded with check-ins and they attempted to split some of the load of just that instance, their entire site was adversely affected.

Foursquare uses MongoDB, a “document database” that also falls into the NoSQL category. One of the themes behind the NoSQL movement is scale, and while this kind of event should not be misinterpreted, it does beg the question about what might be needed to improve newer datastores. For those who want to dig even one level deeper,, there’s a post on the MongoDB user group from 10gen, the company behind MongoDB.

It is great to see leaders like Facebook and Foursquare put the attention to these post-mortem write ups. The collaborative nature of supporting companies helps as well. No doubt the web has a long way to mature, but these folks are showing us how to get there.