Why Hadoop-like ‘Dryad’ Could Be Microsoft’s Big Data Star

December 22, 2010 Off By David
Grazed from GigaOM.  Author: Derrick Harris.

Microsoft’s HPC division opened up the company’s Dryad parallel-processing technologies as a Community Technology Preview (CTP) last week, a first step toward giving Windows HPC Server users a production-ready big data tool designed specifically for them. The available components — Dryad, DSC and DryadLINQ – appear to be an almost part-for-part comparison with the base Hadoop components (Hadoop MapReduce, Hadoop Distributed File System and the SQL-like Hive programming language) and hint that Microsoft wants to offer Windows/.NET shops their own stack on which to write massively parallel applications. Dryad could be a rousing success, in part because Hadoop — which is written in Java — isn’t ideally suited to run atop Windows or support .NET applications.

When Microsoft broke into the high-performance computing market in 2005 with its Windows Compute Cluster Server, it affirmed the idea that, indeed, HPC was taking place on non-Linux boxes. Certainly, given the prevalence of large data volumes across organizations of all types, there’s equal, if not greater, demand for Hadoop-like analytical tools even in Windows environments. Presently, however, Apache only supports Linux as a production Hadoop environment; Windows is development-only.  There are various projects and tools floating around for running Hadoop on Windows, but they exist outside the scope of Apache community support.  A Stack Overflow contributor succinctly summed up the situation in response to a question about running Hadoop on Windows Server:

From the Hadoop documentation:

‘Win32 is supported as a development platform. Distributed operation has not been well tested on Win32, so it is not supported as a production platform.’

Which I think translates to: ‘You’re on your own.’

Then there is the Windows Azure angle, where Microsoft appears determined to compete with Amazon Web Services. Not only has it integrated IaaS functionality in Windows Azure, but, seemingly in response to AWS courting the HPC community with Cluster Compute Instances, Microsoft also recently announced free access to Windows Azure for researchers running applications against the NCBI BLAST database, which currently is housed within Azure. Why not counter Elastic MapReduce – AWS’s Hadoop-on-EC2 service – with Dryad applications in Azure?

SD Times actually reported in May that Microsoft was experimenting with supporting Hadoop in Windows Azure, but that capability hasn’t arrived yet. Perhaps  that’s because Microsoft has fast-tracked Dryad for a slated production release in 2011. As the SD Times article explains, Hadoop is written in Java, and any support for Hadoop within Windows Azure likely would be restricted to Hadoop developers. That’s not particularly Microsoft-like, which certainly would prefer to give additional reasons to develop in .NET, not Java. Dryad would be such a reason.

What might give the Microsoft developer community even more hope is the Dryad roadmap that Microsoft presented in August. ZDNet’s Mary Jo Foley noted a variety of Dryad subcomponents that will make Dryad even more complete, including a job scheduler codenamed “Quincy.”

There’s no guarantee that the production version of Dryad will resemble the current iteration too closely. As Microsoft notes in the blog post announcing the Dryad CTP, “The DryadLINQ, Dryad and DSC programming interfaces are all in the early phases of development and might change significantly before the final release based on your feedback.” A prime example of this is that Dryad has only been tested on 128 individual nodes. In the Hadoop world, Facebook is running a cluster that spans 3,000 nodes (sub req’d). Dryad has promise as Hadoop for the Microsoft universe, but, clearly, there’s work to be done.