Alluxio Delivers First Data Orchestration Platform Powering Multi-cloud Analytics and AI
July 11, 2019Alluxio, developer of open source data orchestration technology used by seven of the world’s top 10 Internet companies, today announced at AWS Summit New York the availability of Alluxio 2.0 with breakthrough innovations for data engineers managing and deploying analytical and AI workloads in the cloud, particularly for hybrid and multi-cloud environments.
The rise of compute intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads – one in which compute scales independently from storage. While this enables scaling elasticity, it introduces new data engineering problems – this is where an abstraction layer is needed. Just as compute and containers need Kubernetes, with data silos only increasing, data also needs orchestration – a tier, one that brings data locality, data accessibility and data elasticity to compute across data silos, zones, regions and even clouds.
“With a data orchestration platform in place, a data analyst or scientist can work under the assumption that the data will be readily accessible regardless of where the data resides or the characteristics of the storage. They can focus on building data driven analytical and AI applications to create values, without worrying about the environment and vendor lock-in,” said Haoyuan Li, Founder and CTO, Alluxio. “These new advancements to Alluxio’s data orchestration platform further cement our commitment to a cloud-native, open source approach to enabling applications to be compute, storage and cloud agnostic.”
Alluxio 2.0 Community Edition and Enterprise Edition includes new capabilities across critical areas that are gaps in today’s cloud data engineering market:
Breakthrough Data Orchestration Innovation for Multi-cloud:
- Policy-driven Data Management
- Alluxio 2.0 includes a new capability that allows data engineers to automate data movement across storage systems based on pre-defined policies on an automated and on-going basis. This means that as data is created and hot, warm, cold data is managed, Alluxio can automate tiering of data across any number of storage systems across on-premises and across all clouds.
- Data platform teams can now reduce storage costs by automatically managing only the most important data in expensive storage systems and moving other data to cheaper storage alternatives.
- Improved Administration of Data Access Policies
- In addition to fine grained policies at the file level, now users can configure policies at any directory and folder level to streamline access of data as well as performance of workloads. These include defining behaviors for individual datasets on various core functions like writing data or syncing data with storage systems under Alluxio.
- Cross Cloud Storage Efficient Data Movement via Data Service
- The new data service allows for highly efficient data movement including across cloud stores like AWS S3 and Google GCS, making expensive operations on object storage seamless to the compute framework.
Compute Optimized Data Access for Cloud Analytics:
- Compute-focused Cluster Partitioning
- Users can now partition a single Alluxio based on any dimension, so that datasets for each framework or workload isn’t contaminated by the other. Most common usage includes partitioning the cluster by framework Spark, Presto etc. In addition, this allows for reduced data transfer costs, constraining data to stay within a specific zone or region.
- Integration with External Data Sources Over REST
- Users can now bring in data even from web-based data sources to aggregate in Alluxio to perform their analytics. Any web location with files can be simplify pointed to Alluxio to be pulled in as needed based on the query or model run.
Amazon AWS Support:
- AWS Elastic Map Reduce (EMR) Service Integration
- As users move to cloud services to deploy analytical and AI workloads, services like AWS EMR are increasingly used. Alluxio can now be seamlessly bootstrapped into an AWS EMR cluster making it available as a data layer within EMR for Spark, Presto and Hive frameworks. Users now have a high-performance alternative to cache data from S3 or remote data while also reducing data copies maintained in EMR.
Architectural Foundations Using Open Source:
Many core foundational elements have been re-architected using the best open source technologies with a vision of hyper scale.
- RocksDB is now used for tiering metadata of files and objects for data that Alluxio manages to enable hyperscale
- GRPC – Google’s highly efficient version of RPC is now the core transport protocol used for communication within the cluster as well as between the Alluxio client and master, making communications more efficient.
“Data is only as useful as the insights derived from it and with organizations trying to analyze as much data as possible to gain a competitive edge, it’s challenging to find useful data that’s spread across globally-distributed silos. This data is being requested by various compute frameworks, as well as different types of users hoping to gain actionable insight,” said Mike Leone, Analyst, ESG. “These multiple layers of complexity are driving the need for a solution to improve on the process of making the most valuable data accessible to compute at the speed of innovation. Alluxio has identified an important missing piece that makes data more local and easily accessible to data-powered compute frameworks regardless of where the data resides or the characteristics of the underlying storage systems and clouds.”
“Whether by design or by departmental necessity, companies are facing an explosion of data that is spread across hybrid and multi-cloud environments. To maintain a competitive advantage, speed and depth of insight have become the requirement,” said Steven Mih, CEO, Alluxio. “Data-driven analytics that were once run over many hours, now need to be done in seconds. AI/ML models need to be trained against larger-and-larger datasets. This all points to the necessity of a data tier which orchestrates the movement and policy-driven access of a companies’ data, wherever it may be stored. Alluxio abstracts the storage and enables a self-service culture within today’s data-driven company.”
Other Features, Include:
- Highly Distributed Data Services – 2.0 introduces the Alluxio Data Service, a distributed clustered service, that data operations such as replication, persistence, for enabling high performance and massive scale.
- Adaptive Replication for Increased Data Locality – New feature to configure a range for the number of copies of data stored in Alluxio that are automatically managed.
- High Availability with Embedded Journal – A new fault tolerance and high availability mode for file and object metadata called the embedded journal that uses the RAFT consensus algorithm and is independent of any other external storage systems. This is particularly helpful for abstracting object storage.
- Alluxio POSIX API – Alluxio’s FUSE feature enables a POSIX compatible API so that frameworks like Tensorflow, Caffe and other Python-based models can directly access data from any storage system via Alluxio using traditional file system access.
Availability Both Alluxio 2.0 Community and Enterprise Edition are now generally available for download via tarball, docker, brew etc.