Q&A: Tomer Shiran of Dremio On Data Lake Engines, Cloud, Critical Data Challenges and More

September 17, 2019 Off By David

As the IT community continues to collect massive volumes of valuable data, managing this giant information stream continues to be a complex and delicate process.

Today’s analytical requirements are putting unprecedented pressures on existing data infrastructures. Performing real-time analytics across operational and stored data is critical to success but continues to be a challenge. Tomer Shiran, co-founder and CEO of Dremio, says the data lake provides the layer that puts data in a more business friendly way — which has been validated by IT. Learn more from our discussion.

CloudCow: Describe what the Dremio Data Lake Engine is and how its different from what you offered customers previously?

Tomer Shiran: A lot of companies have data lake storage, like ADLS or S3, but they can’t get value out of the data stored in it without doing a lot of work to make queries performant – ETL, data warehouses, data marts, things like that.

Dremio’s Data Lake Engine is about delivering really fast query speed directly on data lake storage. Dremio has always made queries very, very fast, but with this new release, we’ve engineered a number of features that take that performance benefit even further and are specifically optimized for a cloud data lake environment – things like our Columnar Cloud Cache and Predictive Pipelining. In addition, we’re adding a bunch of new features related to better integration with AWS and with Azure, particularly in the area of integration with security features that are native to those platforms.

CloudCow: What are some of the key features in the Data Lake Engines for AWS & Azure?

Shiran: I’d say the first key feature is enhanced performance, and that includes a few specific technologies that are new to this release: Columnar Cloud Cache (C3), Predictive Pipelining, and the general availability of Gandiva, a new kernel for Apache Arrow. Columnar Cloud Cache gives you NVMe performance on data lake storage by caching on storage attached to nodes you’re already running Dremio on. While the Columnar Cloud Cache helps with subsequent reads, Predictive Pipelining speeds up the first read by intelligently predicting read patterns and streaming in data before it’s actually needed by our query engine. And Gandiva is a new execution kernel for Arrow that speeds up execution by up to 80x in some cases, along with 5x – 10x improvements from these other technologies.

The second key feature is advanced security for AWS and Azure. For example, Dremio now supports encryption of S3 objects, and storing credentials for data sources to which Dremio connects in AWS Secrets Manager. On the Azure side, we support SSO with Azure Active Directory. These features make secure integration with these data lake technologies much simpler.

There are a few other features as well. The last one I’ll touch on is Dremio Hub, which is a marketplace for people to download new connectors for Dremio. Not only does Dremio support connecting to data lake storage, it also brings in data from many other sources – databases, for example – and Dremio Hub is our way of dramatically expanding that capability.

CloudCow: What are some of the critical data challenges companies moving to the cloud are facing?

Shiran: We talk a lot about a tradeoff that many of our customers feel they have to make between performance and complexity and cost. In order to get performance from cloud data sources, people think there needs to be ETL, cubes, extracts, that sort of thing. The result is very complex data pipelines and analytics architectures. On top of that, even after a migration to the cloud, data is often still in other places – various databases, other clouds, and so on. So it’s not a simple process and despite the tremendous advantages of cloud, the potential exists for a set of data pipelines that are still extremely time consuming and costly. And then of course there’s the issue of performing the migration in the first place in such a way that disruption for data scientists and analysts is minimized.

CloudCow: How does Dremio’s Data Lake Engine solve these problems?

Shiran: We make it so that queries can be run directly on data lake storage, even though it’s high latency. That allows data architecture to be much simpler and gives data architects more flexibility. And Dremio also makes it easy to join between data lakes and other databases, and even across clouds, which means analysts and data scientists have a single source for all the data they need. Lastly, because Dremio offers a self-service semantic layer, the complexity of the underlying data and where it’s stored is hidden from business users, which makes it far easier to effect a migration with minimal downtime.

CloudCow: In the past, companies have utilized specific data processing tools (like Hadoop), what tools does Dremio use?

Shiran: Dremio’s built on a number of technologies that we’d be happy to talk about. I think the most interesting one here is Apache Arrow, which we co-created and which has quickly become the standard for in-memory columnar analytics. I think we’re up to about four million downloads / month on that one.

CloudCow: Is this product targeted at cloud and on-premises customers or just cloud?

Shiran: We have a version that works for hybrid cloud and on-prem customers, in addition to our Data Lake Engines for AWS and Azure.

CloudCow: How does Dremio work with BI tools like with Tableau and PowerBI?

Shiran: It’s really easy to connect these tools to Dremio. Even though Dremio can connect to lots of different sources, it exposes everything as a single, high-performance SQL database that can be connected to via ODBC, JDBC, or REST. In Power BI, Dremio has a Certified Connector built right in.

Tomer Shiran, Co-Founder and CEO

Tomer is co-founder and CEO of Dremio. Previously he was the VP Product at MapR, where he was responsible for product strategy, roadmap, and new feature development. As a member of the executive team, he helped grow the company from 5 employees to over 300 employees and 1000 enterprise customers. Prior to MapR, Tomer held numerous product management and engineering positions at Microsoft and IBM Research. Tomer holds an MS in Electrical and Computer Engineering from Carnegie Mellon University and a BS in Computer Science from Technion – Israel Institute of Technology. He is also the author of five US patents.

https://www.linkedin.com/in/tshiran/
@tshiran

CategoryArticles Interviews

Tagscloud data management interviews

ManageEngine Announces Oracle Cloud Infrastructure Monitoring at Oracle OpenWorld 2019

AppDirect Announces Move to Google Cloud Platform (GCP), Partners With Google to Deliver IaaS to Customers