Hadoop: From Open Source Project to Big Data Ecosystem

October 3, 2010 Off By David
Grazed from GigaOM.  Author: Martin Hall.

The Hadoop hoopla is generating increasing numbers of announcements from more and more vendors. From startups to large established players, new products and partnerships are emerging which confirm the emergence of a vibrant Big Data ecosystem evolving around Apache Hadoop.

However, there’s frequent misunderstanding of the layers at which companies are operating, which leads to misconceptions over which collaborate and which compete. Matt Asay’s recent article, for example, gave a good overview, but made the too-common assumption that companies compete simply because they both do something with Hadoop. Another article about the emerging big data “SMAQ” stack nailed the concept, but didn’t explain how to think about each vendor, where they fit in relation to one another and how they interconnect.

It can also be said that lack of a vendor ecosystem definition is hurting adoption. A recent survey revealed that the largest challenge facing enterprises considering Hadoop is the steep learning curve. Misunderstanding the richness of layers and categories of a vendor ecosystem contributes to the steepness.

As we roll up to the next big Hadoop event, it’s time to formalize the emerging Hadoop-based Big Data solution ecosystem as it is today and set the stage for where it going. At Karmasphere, we define and delineate the layers and categories of product in a pretty familiar ecosystem of three layers: Infrastructure, Data Management and Big Data Applications.

All three layers are necessary to handle the unprecedented volumes of data that need to be turned into meaningful results. The results are extremely varied, whether looking for a friend, getting a movie, book or professional colleague recommendation, understanding the spread or triggers of disease, detecting fraud, comprehending the behavior and buying pattern of a customer; the opportunity and competitive advantage inherent in burgeoning data is vast. The Big Data solution ecosystem already handles these and many other analytics today.

For the ecosystem to function, it first needs to address the different layers of Hadoop. As a collection of Apache Foundation projects, Hadoop includes the core MapReduce and distributed file system projects. It also includes Chukwa, HBase, Hive, Pig and ZooKeeper. Such a mesh of projects forms the essential cylinders of Hadoop-related power. To ultimately deliver value to today’s enterprise, these projects each need to work well within private data center or public cloud infrastructures (or both simultaneously). Cooperation with existing data repositories is key, and tools are required that turn collections of open-source projects into highly valuable, prime-time manipulators of vast amounts of data.

Vendors within the ecosystem’s Infrastructure Layer provide hardware and software that do commodity processing and base storage. This includes cloud providers and private data centers. The Data Management Layer in the Hadoop ecosystem includes more “under-the-hood components.” In initial Hadoop deployments, this layer often contains some, many, or all of the Hadoop open-source projects sourced from the Apache Foundation itself or from companies that make their own distribution of Hadoop like Cloudera, IBM and Yahoo.

Alternatively, a cloud provider like Amazon Web Services offers Hadoop as a service on top of its basic compute and storage infrastructure. Hadoop’s incorporation within the enterprise data fabric is being further enabled and validated by connectors to traditional RDBMS and data warehouse products, which are also actively becoming important members of the ecosystem. Admin tools, some coming from Hadoop distribution providers, some likely to come from other vendors, enable provisioning and management of Hadoop clusters for enterprise operators and administrators.

In combination, the Infrastructure and Data Management layers of the Hadoop ecosystem create a giant data power grid for the twenty-first century enterprise. To manage and exploit that power, the Big Data Applications layer is where you find the software for developers and analysts. Developer tools, software for technical analysts or business intelligence applications are all coming to market to help data professionals harness the power of the Hadoop to transform, analyze and speedily deliver meaningful results. Karmasphere, Datameer, IBM and traditional business intelligence vendors are all adding value at this layer.

Each new vendor product and service announcement adds more richness and choice for Hadoop users. This ecosystem is helping accelerate adoption and better support of real-world business requirements and the ability to harness the power of big data today and for many years down the road.