Why a DIY Big Data Stack Is a Better Option

September 4, 2010 Off By David

Grazed from GigaOM. Author: Shion Deysarkar.

Today, many conversations within the big data community are centered around the rise of the standard, big data stack, which includes utilities like HDFS, HBase, and other increasingly-popular applications. While settling on a standard big data stack is deeply important to the big data industry as a whole, I’m nonetheless questioning the operational and competitive consequences for companies who choose to buy into this standard without first considering the value of building their own proprietary solution.

Where We Are Today: Limited Choice

First, let me say that the Hadoop big data stack is an impressive achievement and something many respectable big data players have been rooting for. Players like Infochimps, Cloudera and Riptano have shown significant support and found success. With vendors like these as evidence, it’s clear the Hadoop stack is great for optimal data processing in many scenarios.

Before people get up in arms, I’ll say that there really is no perfect data solution that works for every situation. This realization led us to build our own solution.

The Road Less Traveled

At 80legs, we chose the proprietary solution — which is a far more appealing option than most would acknowledge today — for many reasons. For one, we have a unique need to collect large volumes of data when crawling millions of websites, rather than simply processing it. Our stack has dramatically lower costs as a result of our unique situation, and unlike the standard big data stack, ours doesn’t need to support storing large volumes of data. Instead, the nodes handle storage all in the same location data is being processed.

For all of you evaluating whether to do it yourself, keep in mind, sometimes what’s available on the market isn’t going to solve every one of your problems. Especially as you start to throw around petabytes of data.

Realistic Roadblocks

Now, we wouldn’t be telling the whole truth if we neglected to point out 2 big potential pitfalls in building your own big data stack.

First, with a custom-built stack, there is almost no support. Bugs and problems tend to be unique to your system. At 80legs, the only experts we can call upon are ourselves, and as a business, we have to be ready to pivot into problem solving mode because we can’t rely upon anyone else. We’re on our own.

We’ve spent a lot of time debugging things for which there are no support forums, and it can be quite frustrating when you see your open source friends resolve their problem in minutes by logging on to IRC.

Second, building features into your system over time can create a lot of moving parts: lots more than even the most decked-out big data stack. As part of this, you won’t have access to technologies that would handle multiple components, such as Cloudera’s CDH.

Significant Competitive Advantage

We had strong reasons to go our own road, and the downsides can be maddening at times, but there are still key advantages you get from doing it yourself that will make a significant impact on the success of your business. One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.

At 80legs, our back-end is optimized to the byte-level, by hand, literally. This is no small thing. Squeaking out a 2-4 percent improvement in your data/processing throughput is important at this scale. If you don’t believe us, then look to Facebook, Twitter and Google, all of whom have built their own systems, even if some eventually spun out as open source solutions (e.g. Cassandra). Facebook chose to build its own solutions in certain areas because it gives it a technology advantage, which in turn bolsters its dominance in the market.

The other benefit to going our own way is a sustainable competitive advantage over time. An important lesson I’ve learned is that most true competitive advantages are operational and cultural ones, contrary to popular thinking. If you drive the same, standard vehicle as everyone else, you’re not going to get the same performance as a custom-fitted ride. In the world of big data, your stack is the most important component of your operational strategy. So do you want to be just as good as everyone else, or better?

Here’s an important side note: If you’re looking at this from a startup’s perspective, there’s another consideration. Which company is more likely to be acquired? One with a unique, well-performing back-end “secret sauce”, or one with a standard, run-of-the-mill stack? Where is your IP?

People have often asked me whether or not I would have used the big data stack if 80legs was built today, and the answer is still a resounding “No.” If you’re not trying to compete on data processing throughput or other performance metrics, I recommend you use the standard big data stack. You’ll save time and money and have more support resources available in the community.

However, if you want to compete on operational performance in your market, you need to seriously consider building a proprietary solution in-house. Frankly, provided you have the chops to outperform the standardized stack, why wouldn’t you pursue a technology advantage that provides sustained business value?

CategoryNews

New ISACA audit programs include cloud computing focus

VMworld Met Expectations, But Is That Enough?