Object Storage: Where Cloud Computing and Big Data Meet

May 21, 2012 Off By David
Object Storage
Grazed from Computer Technology Review.  Author:  Tom Leyden.

At the NAB 2012 show last month, the storage industry was more visible than ever, and the buzzword emerging from the event was ‘object storage.’ The term is inextricably tied to storage clouds and certain kinds of Big Data applications. And it’s becoming increasingly relevant for private clouds as well as for service providers.

The cloud industry has been on steroids lately – moving from adoption of virtualization to more cloud-friendly architectures with the hypervisor support, control panels and Web services-based interfaces that are increasingly the norm. Industry analysts agree that cloud computing will continue to stimulate the astronomical growth in storage requirements…

Meanwhile, the trend toward Big Data-sized storage requirements has moved beyond Big Data in its original form, that is, Big Data Analytics on very large data sets of very small log files. Big Data now also includes several categories of Big Unstructured Data, typified by the movie industry with its high-res movie files and large archives of digital media content – or by scientific applications such as genomics or geographical research.

Traditional RAID Approaches Won’t Scale
To manage this fast growing Big Data, companies need to shift away from relying on traditional – which is often still RAID – storage approaches. Big Data applications such as storage clouds require scalable architectures beyond hundreds of petabytes, the highest efficiency in the form of low power, low overhead infrastructures, high availability and durability into the ten 9s, and a straightforward interface such as representational state transfer (REST).

Object storage is the new storage paradigm that was specifically designed for unstructured Big Data applications. It’s the way Facebook, Google and Amazon have addressed the storage cloud scalability issue, and it fits many IT infrastructures today.

The cause and effect relation between online – i.e. cloud – applications and data growth is not always crystal clear. Are online applications designed to help us manage our growing data sets or do we simply store more data because it has become more convenient? The fact is that the amount of data we all store “in the cloud” is growing exponentially.

Less than five years ago, before the advent of online – cloud – applications, our data was mostly stored locally. To protect our data, we relied on various backup procedures, usually manually assisted. As datasets grow, these procedures are becoming less and less efficient.

Also, finding a specific file or document has become more difficult. While file systems were designed to keep your data organized, the explosion of unstructured data makes file systems quite complex and unorganized when it comes to looking for a specific document: Think photo storage systems such as Picasa or office suites such as Google Docs.

This is where cloud computing and Big Data meet. Backup and recovery processes are increasingly being moved to the cloud. Document sharing is hot (Dropbox!) and archives are moved back off tape onto disk storage systems because data archives are so much more valuable when the data is actually accessible. Lastly, the Social-Local-Mobile hype stimulates us to generate data everywhere and to demand accessibility anywhere.

How Object Storage Works
To meet those requirements, companies such as Facebook, Amazon and Google designed scale-out storage systems of their own. Today, a number of object storage solutions are available on the market that can cope with data volumes such as those of Facebook’s or Amazon’s. The backend consists of a scalable storage pool. Typically, this storage pool is built up out of commodity storage nodes with very high density and low power consumption. In the front, there will be a number of controller nodes that provide performance. It’s best practice to be able to scale performance and capacity separately, as you don’t want to spend big bucks on high-end storage nodes, the processing power of which won’t be used.

Access to the data comes through a REST interface, so the applications can read/write data without a file system in between. Files – objects – are dumped into the pool and an identifier is kept to locate the object when it is needed. Applications that are designed to run on top of object storage will use these identifiers through the REST protocol. A good analogy is parking your car using valet parking versus self-park. When you self-park, you have to remember the lot, the floor, the aisle, etc (file system); with valet parking, you get a receipt when you hand over your keys, a receipt that you will use later retrieve your car.

For data protection, RAID is obviously out of the question for a number of reasons. First, petabyte-scale systems in which all storage nodes have high-end processors (for rebuild purposes) would not be cost-efficient, and RAID does not allow you to build a true single storage pool. In addition, RAID requires a huge amount of overhead to provide acceptable availability. The more data we store, the more painful it is to need 200 percent overhead as some systems do. Also, due to larger disks, restores of broken disks in a RAID system take too long and leave the system less protected.

A more recent and much more reliable alternative that provides the highest level of protection (10 9s and beyond) is erasure coding. Erasure encoding stores objects as equations, which are spread over the entire storage pool: Data objects (=files) are split up into sub-blocks, from which equations are calculated. According to the availability policy, an overhead of equations is calculated and the equations are spread over as many disks are possible, also policy-defined.

As a result, when a disk breaks, the system will always have sufficient equations to restore the original data block. If a disk is broken, the system can re-calculate equations as a background task to bring the number of available equations back to a healthy level.

Apart from providing a more efficient and a more scalable way to store data, erasure coding-based object storage can save up to 70 percent on the overall TCO, thanks to reduced raw storage needs and reduced power needs – including less hardware and low power devices to save on power and cooling. Also, uniformly scalable storage systems with an automated healing mechanism drastically reduce the management effort and cost.

Checklist: Does Object Storage Fit My Needs?
Here’s a brief checklist of considerations to review if you’re considering object storage.

  1. How much storage do you currently have and what is your growth rate? Object storage was designed for large data volumes (Big Data) –typically one petabyte and above. Some technologies claim to be optimal for smaller infrastructures but bear in mind that these are likely not to support your growth.
  2. What availability do you require? Should that be the same for all users and all data? While most object storage platforms claim availability beyond ten 9s, you may want to select a platform that provides flexible availability policies. Not all data and not all users require the same availability levels.
  3. How many datacenters do you want to support? If you can spread your data over more than two datacenters, you may want to select an object storage platform that can geo-spread without replicating data, for lower storage overhead.
  4. How big is your storage team? Remember that in 10 years’ time, your team will likely have to support 20 times as much storage. Automation should be a key feature of the storage system of your choice.
  5. What are your power limitations? Power should be your number one concern when selecting a storage platform. Some systems feature power consumption as low as three watts per terabyte.