Q&A: Jason Nadeau of Dremio Talks Data Lakes, Storage and Cloud
December 4, 2019One thing has become crystal clear to me throughout 2019, the IT community is continuing to collect massive amounts of valuable data, but managing this volume of information continues to be a challenge. There are a lot of data movement, copies, silos, hand-offs, points of failure and so on that take place. And this has proven painful for data engineers and IT, and painful for data analysts, BI analysts, and data scientists.
To better understand what’s happening, CloudCow reached out to Jason Nadeau, VP of Marketing at Dremio, to learn more.
CloudCow: Why is the work involved in creating a data lake so complex and time- and resource-intensive?
Jason Nadeau: There are a few key challenges around creating a data lake. The first involves actually getting the data into data lake storage from the many existing underlying data sources that an organization already has. Traditionally, organizations would use ETL processes to put that data into a data warehouse, but an increasing number of organizations are using similar ETL processes to land the data directly into their data lake. However, there are often different tools and skill sets required to get data into data lakes.
Arguably, the biggest challenge is that the processors and consumers of the data (for example, data engineers, data analysts, data scientists, and BI users) are all connecting to their data sources in a particular way. And as the underlying data architecture changes to move and relocate data into data lake storage, all of those connections and data pipelines need updating and rebuilding.
The third challenge is around actually building and maintaining the data lake infrastructure itself. For many organizations, their data lake storage is built on-premises on top of Hadoop, which has proved too difficult to manage and scale, requiring sophisticated and highly skilled teams. At the same time, many organizations are looking to invest more in public cloud services and less in on-premises infrastructure, which can constrain resources and make success even harder.
CloudCow: Does moving data lakes into clouds such as S3/AWS and Microsoft Azure solve the complexity?
Nadeau: Only partly. Cloud data lake storage such as AWS/S3 and Microsoft ADLS help address the third challenge outlined above, in that they deliver storage capabilities as a service, and thus eliminate the need to manage the underlying infrastructure. There are also an increasing number of data sources that are already in the cloud, making them somewhat easier to connect. But cloud data lake storage still needs to be loaded with data, and data architects and engineers still have to deal with the cost, complexity, and risk of rebuilding and reconnecting data pipelines as data sources change.
CloudCow: Do some companies maintain some combination of a cloud and on-premise data lake solution, if so, why?
Nadeau: Yes, especially larger organizations and enterprises. These organizations have the skills required to effectively manage on-premises infrastructure. This gives them the flexibility to choose the right infrastructure, whether on-premises or from a specific cloud provider, based on their requirements today, knowing they can change and adapt in the future. For most, the benefit of this flexibility manifests primarily as lower costs and freedom from lock-in. And often, data governance issues are a driver for keeping data lakes on-premises, especially in certain regions and countries.
CloudCow: Describe the challenges in moving data storage to the cloud, specifically with regard to migration and data management?
Nadeau: Moving and migrating the actual data from one place to another, for example, from on-premises to the cloud, is by itself quite straightforward. Many mature technologies exist for this purpose, such as synchronous or asynchronous replication, copied snapshots, backups and restores, and so on. But the real challenge is that users and applications are often built and configured with the assumption that the data location is fixed. Those users and applications are also constantly accessing that data.
As a result, changing the location of the data not only tends to introduce disruptive downtime but also typically involves significant and expensive changes up and down the entire application stack. To complicate matters further, new data sets are coming online all the time, and many usage patterns require simultaneous access to multiple data sets. The result is that large-scale data architecture and migration projects are typically expensive multi-year endeavors. Such projects are akin to rebuilding the engines on a jet aircraft while it is flying – a difficult proposition.
CloudCow: How can organizations make data that’s in a data lake queryable directly by analysts and data scientists?
Nadeau: Once data is loaded into the data lake storage, the biggest barrier to making that data directly queryable is performance. Data lake storage such as AWS S3 and Microsoft ADLS is optimized for cost, scale, and durability, but not latency, and low latency is required for any sort of interactive analysis. As a result, there are a number of workarounds that involve copies of data, along with significant IT effort, to achieve performance. BI extracts and OLAP cubes are by far the most common approach. But while they do increase performance and thus interactivity, they necessarily limit the data under analysis, and that limits the effectiveness of the analysis.
We believe the better approach is to start with a new architecture, a SQL engine directly against data lake storage, and then build upon it to tackle the latency challenge head-on. A couple of examples of SQL engines are Apache Presto (open source) and AWS Athena, which is built using Presto. But while Presto-based SQL engines simplify data access, they still do not provide the performance required to make the data lake usable by data consumers, especially at enterprise scale.
Here at Dremio, we’ve expanded upon the SQL engine concept to create a purpose-built data lake engine, which uses multiple highly-efficient query acceleration technologies to deliver the necessary interactive performance. For example, Dremio is the co-creator of Apache Arrow, Apache Gandiva, and Apache Arrow Flight, which are a set of columnar data processing and data exchange technologies. Our data lake engine is deeply architected around these technologies, and they deliver dramatic performance gains vs. traditional SQL engines.
Dremio has surrounded this columnar core with a number of other innovative technologies such as NVMe-based caching, a form of dynamic and transparent materialized views called Data Reflections, massively parallelized cloud data lake storage readers, and elastic cloud scale. Together, these technologies deliver 2-3X the performance of SQL engines such as Apache Presto at a similar cost. They can scale performance and cost linearly from there to meet the demands of data and data consumers at any scale. Dremio’s data lake engine finally delivers the interactive experience that data analysts, BI users, and data scientists have been waiting for.
CloudCow: Explain how organizations can get into the cloud and really take advantage of everything that a cloud data lake has to offer?
Nadeau: The easiest way for larger organizations and enterprises to start with a cloud data lake is to put new data there and start new analytics initiatives there. But the real value – and the real challenge – comes from migrating data from the existing on-premises solutions and aggregating that data into cloud data lake storage. Remember that data consumers are connecting into all of the existing data sources, and any changes to those sources as a result of migrations break those connections and require data engineers to rebuild them.
What enterprises thus need is a way to abstract the underlying migration to the cloud from the usage by the data consumers – the data analysts, BI users, and data scientists. This is exactly what Dremio provides, via our self-service semantic layer. Dremio implements the concept of virtual datasets, which map to physical datasets, and data consumers only connect to the virtual datasets. This provides the flexibility to make changes to underlying physical data sets without breaking existing pipelines or requiring rework. Armed with Dremio’s semantic layer, organizations can modernize their data infrastructure and move to the cloud at a pace that best suits them, while ensuring uninterrupted access to data for data consumers.
Finally, organizations can maximize their flexibility and keep their costs low by treating data lake storage as the source of record and not ingesting it into a proprietary data warehouse. The same is true whether a given proprietary data warehouse is cloud-native or delivered as an on-premise appliance. The important point is that proprietary data warehouses were created in a world before data lake storage and data lake engines. But in this new world, most organizations can avoid the expense and lock-in implicit in proprietary data warehouses.
CloudCow: Can you explain the difference between multi-cloud and hybrid cloud data lake storage?
Nadeau: Our point of view is that multi-cloud refers to the use of more than one set of cloud data lake storage, but that storage is used by a separate and distinct set of applications or workloads. In this way, an organization can spread its workloads around multiple cloud providers. This is true whether some of the data lake storage is on-premises (private cloud) or only in public clouds.
In contrast, a hybrid cloud refers to the use of more than one set of data lake storage in support of a single workload. In the hybrid cloud, the workload joins data from multiple data lake storage services. In this scenario, there could be multiple workloads, each joining data from more than one set of data lake storage. Once again, this is true whether some of the data lake storage is on-premises (private cloud) or only in public clouds.
CloudCow: In 2020, do you think organizations, with highly regulated data like healthcare organizations, will shift to the cloud?
Nadeau: Absolutely, and this shift is already underway. The cloud offers organizations in every industry (including highly regulated ones like healthcare) the agility, flexibility, and scalability to create and maintain a competitive advantage. What is really interesting to us is that the costs of running in the cloud are turning out to be a lot higher than many predicted. Now, enterprises are actively evaluating and selecting cloud services based on their ability to deliver cost savings through higher compute efficiency. Once again, this is an area where Dremio shines, as our data lake engine is extremely compute efficient.