Why a DIY Big Data Stack Is a Better Option

September 4, 2010 Off By David
Grazed from GigaOM.  Author: Shion Deysarkar.

Today, many conversations within the big data community are centered around the rise of the standard, big data stack, which includes utilities like HDFS, HBase, and other increasingly-popular applications. While settling on a standard big data stack is deeply important to the big data industry as a whole, I’m nonetheless questioning the operational and competitive consequences for companies who choose to buy into this standard without first considering the value of building their own proprietary solution.

Where We Are Today: Limited Choice

First, let me say that the Hadoop big data stack is an impressive achievement and something many respectable big data players have been rooting for. Players like Infochimps, Cloudera and Riptano have shown significant support and found success. With vendors like these as evidence, it’s clear the Hadoop stack is great for optimal data processing in many scenarios.

Before people get up in arms, I’ll say that there really is no perfect data solution that works for every situation. This realization led us to build our own solution.

The Road Less Traveled

At 80legs, we chose the proprietary solution — which is a far more appealing option than most would acknowledge today — for many reasons. For one, we have a unique need to collect large volumes of data when crawling millions of websites, rather than simply processing it. Our stack has dramatically lower costs as a result of our unique situation, and unlike the standard big data stack, ours doesn’t need to support storing large volumes of data. Instead, the nodes handle storage all in the same location data is being processed.

    1. 1. A Unique Need for Collecting, Rather than Processing Big Data: 80legs is a web crawling service.  While most of the big data world is focused on how to process data, we (and our customers) worry about how to get that data in the first place. Across all 80legs users, we’re crawling anywhere between 15 to 30 million web pages per day.  That requires downloading 10 TB of data every month. We want to be doing at least 5x that by the end of the year. The standard big data stack isn’t ready to scale in this manner. Rather than spending a fixed amount on bandwidth for this data influx, our stack shifts the bandwidth work to the nodes.  We have about 50,000 computers using their excess bandwidth. You can forget AWS if you want to do this much crawling – the bandwidth cost you’ll incur will be too much.
      2. Dramatically Lower Costs: Shifting the cost of bandwidth to the nodes is a bigger deal than you might think. The standard big data stack would have you pay for bandwidth in terms of, well, bandwidth. Our system can scale to more CPU time and bandwidth than most clouds, instantly – that’s the big payoff. Instead of paying for volume of data collected, we pay for the time spent getting that data.  In other words, we’ve setup a nice little arbitrage. While the standard big data stack has made huge strides in making big data more accessible to everyone, it will always fall short against our stack when it comes to the cost of collecting data.
      3. No Need to Store Big Data: We actually don’t store that much data. Because 80legs users can filter their data on the nodes, they’re able to return the minimum amount of data in their result sets.  The processing (or reduction, pardon the pun) is done on the nodes.  Actual result sets are very small relative to the size of the input set.

For all of you evaluating whether to do it yourself, keep in mind, sometimes what’s available on the market isn’t going to solve every one of your problems. Especially as you start to throw around petabytes of data.

Realistic Roadblocks

Now, we wouldn’t be telling the whole truth if we neglected to point out 2 big potential pitfalls in building your own big data stack.

First, with a custom-built stack, there is almost no support. Bugs and problems tend to be unique to your system.  At 80legs, the only experts we can call upon are ourselves, and as a business, we have to be ready to pivot into problem solving mode because we can’t rely upon anyone else. We’re on our own.

We’ve spent a lot of time debugging things for which there are no support forums, and it can be quite frustrating when you see your open source friends resolve their problem in minutes by logging on to IRC.

Second, building features into your system over time can create a lot of moving parts: lots more than even the most decked-out big data stack. As part of this, you won’t have access to technologies that would handle multiple components, such as Cloudera’s CDH.

Significant Competitive Advantage

We had strong reasons to go our own road, and the downsides can be maddening at times, but there are still key advantages you get from doing it yourself that will make a significant impact on the success of your business. One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.

At 80legs, our back-end is optimized to the byte-level, by hand, literally. This is no small thing. Squeaking out a 2-4 percent improvement in your data/processing throughput is important at this scale. If you don’t believe us, then look to Facebook, Twitter and Google, all of whom have built their own systems, even if some eventually spun out as open source solutions (e.g. Cassandra). Facebook chose to build its own solutions in certain areas because it gives it a technology advantage, which in turn bolsters its dominance in the market.

The other benefit to going our own way is a sustainable competitive advantage over time. An important lesson I’ve learned is that most true competitive advantages are operational and cultural ones, contrary to popular thinking. If you drive the same, standard vehicle as everyone else, you’re not going to get the same performance as a custom-fitted ride. In the world of big data, your stack is the most important component of your operational  strategy.  So do you want to be just as good as everyone else, or better?

Here’s an important side note: If you’re looking at this from a startup’s perspective, there’s another consideration. Which company is more likely to be acquired? One with a unique, well-performing back-end “secret sauce”, or one with a standard, run-of-the-mill stack?  Where is your IP?

People have often asked me whether or not I would have used the big data stack if 80legs was built today, and the answer is still a resounding “No.” If you’re not trying to compete on data processing throughput or other performance metrics, I recommend you use the standard big data stack. You’ll save time and money and have more support resources available in the community.

However, if you want to compete on operational performance in your market, you need to seriously consider building a proprietary solution in-house. Frankly, provided you have the chops to outperform the standardized stack, why wouldn’t you pursue a technology advantage that provides sustained business value?