Q&A: Jason Nadeau of Dremio Talks Dremio AWS, Data Migration Challenges, Cloud Data Lakes and More

Q&A: Jason Nadeau of Dremio Talks Dremio AWS, Data Migration Challenges, Cloud Data Lakes and More

May 6, 2020 Off By David
Object Storage

This week, Dremio introduced a new offering, purpose-built for Amazon Web Services (AWS), with two new technologies to support on-demand data lake insights and reduce cloud infrastructure costs.

To find out the latest, CloudCow spoke with Jason Nadeau, VP of Marketing for Dremio.

CloudCow: Tell us about the new features in Dremio AWS Edition.

Jason Nadeau: The AWS Edition also introduces two new technologies to support data lake insights on demand as well as to reduce cloud infrastructure costs: parallel projects and elastic engines.

Parallel projects are multi-tenant Dremio instances, with end-to-end lifecycle automation across deployment, configuration with best practices, and upgrades, all running in customers’ own AWS accounts. This deep automation delivers a service-like experience where data engineers and data analysts can deploy an optimized Dremio instance from scratch, start querying their data in minutes, and effortlessly stay current with the latest Dremio features. Parallel projects thus bring a “best-of-both-worlds” approach, delivering the ease of SaaS with the full control of software running in customers’ VPC.

Elastic engines are independently sized query engines with on-demand, elastic scale and automated start and stop that eliminate under and over provisioning as well as costs for idle compute. With elastic engines, data teams can configure any number of query engines within a given project, with each sized and tailored to the workload it supports. Elastic engines thus eliminate both under- and over-provisioning of compute resources, thereby maximizing concurrency and performance while at the same time minimizing the required compute infrastructure. Each engine also features elastic, on-demand scaling. When there is no query activity for a given engine it remains shut down and consumes no compute resources. Incoming query demand triggers the engine to automatically start and elastically scale up to its full, tailored size. When the queries stop, the engine again automatically and elastically scales back down and stops. In this way elastic engines automatically eliminate resource consumption & cost for idle workloads. The result is >60% costs savings for typical mixed workload environments vs. single-execution cluster approaches.

CloudCow: What are some of the challenges data teams have when migrating to the cloud?

Jason Nadeau: Data migration is clearly a challenge, especially given the volumes of data in some on-premise data lakes. But another big challenge is migrating the semantic layer so that KPIs and business logic remain consistent. Many organizations have built that layer into their legacy BI tools – but as part of the shift to the cloud, many organizations are also switching to more modern visualization tools like Tableau and PowerBI. Dremio centralizes the semantic layer and ensures it is consistent, regardless of which visualization solution is used. This also dramatically simplifies data management, and is why many of our customers deploy us both on-premise and in the cloud.  

CloudCow: When building a data lake on the cloud, is performance a deterrent?  

Jason Nadeau: Yes it is for OLAP and queries directly against data in data lake storage, and that is one the key challenges Dremio directly addresses. This lack of performance has led many organizations to copy and move portions of their data from cloud data lake storage into proprietary data warehouses as a way to try and work around the performance issues. The cloud in general, and cloud object storage in particular, require applications to be architected and built differently. Applications need to be highly parallelized and scale-out for example, and for OLAP they need to be optimized for columnar data, both in-memory and with persistent storage. But even then performance may not be enough for some dashboarding and reporting use cases, and this is where additional acceleration technologies are required. At Dremio, we believe those should be built-in to the query engine itself, and made transparent to data consumers using BI and data science tools since this dramatically simplifies the environment and shortens the time to analytics. 

CloudCow: How has Dremio’s data lake platform evolved to support on-demand analytics?

Jason Nadeau: We’ve been innovating since we launched the company to deliver lightning fast query performance optimized for cloud data lake storage, as well as to deliver a centralized, consistent semantic layer that provides a self-service experience for data consumers.

For performance, we’ve architected our platform to be end-to-end columnar using Apache Arrow, an extremely popular columnar in-memory format we co-created and open sourced in the early days of the company. You can think of Arrow as the in-memory counterpart to popular on-disk formats like Apache Parquet, and increasingly as the standard used by many different systems. And over time we’ve enhanced our platform with a rich set of performance technologies such as an LLVM-based execution kernel (based on Apache Gandiva); massively parallel object store readers with predictive pipelining; transparent caching into fast NVMe cloud storage; and a form of materialized views we call data reflections, which are physically optimized representations of source data, persisted as Parquet. The Dremio query optimizer can accelerate a query by utilizing one or more reflections to partially or entirely satisfy that query, rather than processing the raw data in the underlying data source. And this all happens behind the scenes inside Dremio, and transparently to data consumers connecting to their virtual data sources within Dremio. And with the AWS Edition we go even further with multiple elastic engines, which eliminate resource contention ensuring consistent peak performance for every query workload.

And for our semantic layer, we’re constantly improving things over time to make it really easy for data engineers to publish and govern virtual data sets to their diverse range of data BI and data science end users. These virtual data sets enable completely virtual transformations of data, including joins and other operations, to enable data teams to create an unlimited number of views into their data, all without copies and their associated cost, complexity, and governance risk. End users can then find the virtual data sets they need and connect their tool of choice. And if those end users need additional performance for dashboarding and reporting-style queries, data engineers can easily and transparently configure reflections to speed things up by 10-100X, all without any changes from the perspective of the end user. All this shrinks to time to start analyzing analytics from weeks or even months to minutes.

CloudCow: What are the benefits of Dremio AWS Edition over traditional software deployments?

Jason Nadeau: We’ve built Dremio AWS Edition to give customers the benefits of a service-like experience, with full control in their own AWS account and VPC. It’s the best of both worlds. To do this we use deep automation across the lifecycle. For example we have automation that makes it fast and easy to deploy Dremio and optimally configure it for AWS. Users can be up and running and querying data from scratch in 10 minutes. We also have automation that makes it effortless for our users to stay updated on our latest version and take advantage of new features and capabilities. Users simply need to stop, then restart their instance – that’s it. Dremio automatically upgrades to the latest software version behind the scenes. And we have automation that ensures customers use their Dremio query engine resources extremely efficiently, starting and stopping them dynamically based on query activity.

##