Big data: You’ll have it, but can you handle it?
April 25, 2011In 1999, I was called in to troubleshoot a customer’s client/server application that had recently failed a government acceptance test by taking more than 20 minutes to complete queries during stress testing. After months of intense software redesign that included overcoming pushback from a recalcitrant software development team, we were able to increase query performance by 2,000 percent, and the system subsequently passed its acceptance test.
That experience taught me two hard-fought lessons: First, even though I am a staunch advocate of Donald Knuth’s admonition that “premature optimization is the root of all evil,” performance matters. And second, scalability is hard to achieve.
Or at least it used to be. Cloud computing is changing that. It is making scalability easier and enabling a proportional increase in the size and scope of data that organizations can process. These two ramifications — instant scalability and the advent of “big data” — are reshaping the computing and information management landscapes. Previously, big data would significantly degrade the scalability of an application, and programmers would therefore introduce throttling mechanisms or look to Moore’s law to bail them out of performance problems. But now, you can have your cake (big data) and eat it, too (scalability)! As we will see, nowhere is the need for processing big data more urgent that in the U.S. government.
For the Defense Department, the ability to rapidly exploit huge volumes of data can mean the difference between life and death. Thus, the Army recently announced it had deployed its first tactical cloud to Afghanistan. The Health and Human Services Department is funding grants to sift the huge volumes of data expected to follow adoption of electronic health care records. Meanwhile, the National Oceanic and Atmospheric Administration and Environmental Protection Agency routinely create huge quantities of sensor data as they monitor the physical environment.
Agency by agency, from the Securities and Exchange Commission to the Justice and Homeland Security departments and every other large government organization, volumes of data are increasing exponentially. They are already struggling with big data, want big data to improve analysis and make better decisions, or a combination of the two.
So what is big data? First, it is not your father’s data. Some examples are cell phone geolocation data, sensor data, surveillance data, Wikipedia text, social media status updates and many other streams of continuous data. These streams might not be record- or document-oriented. Instead, they are often transient or might be aggregated from multiple sources.
In fact, I am beginning to see big data as an emerging data type with a unique set of properties and challenges. Big data has different meta data and processing requirements, as is evident in parallel processing algorithms such as map/reduce.
With big data, all the common meta data attributes for accuracy, lineage, security and privacy take on increased importance because of the volume of data in question. Meanwhile, parallelization is a key part of processing big data to enable useful results in a reasonable time frame. Along with parallelization, visualization and summarization are core processing techniques for big data.
Given the scale of these datasets, a processing mistake or unauthorized spillage of big data means big trouble. In addition to increased prudence, we must add meta data attributes that are unique to big data, such as granularity, degree of aggregation, use of heuristics and the degree of preprocessing. Other possible meta data attributes that might be applicable include time span, geospatial info, transience, transactional capabilities, and many others.
For government IT managers and CIOs, big data is at the doorstep. Now is the time to rethink your data architectures to accommodate this new type of data. Big data will hold great promise or peril based on your ability to understand and take advantage of it.