The Apache Software Foundation Announces Apache Hudi as a Top-Level Project
Open Source data lake technology for stream processing on top of Apache Hadoop in use at Alibaba, Tencent, Uber, and more
The Apache Software Foundation, the all-volunteer developers, stewards, and incubators of more than 350 Open Source projects and initiatives, announced today Apache Hudi as a Top-Level Project (TLP).
Apache Hudi (Hadoop Upserts Deletes and Incrementals) data lake technology enables stream processing on top of Apache Hadoop compatible cloud stores & distributed file systems. The project was originally developed at Uber in 2016 (code-named and pronounced “Hoodie”), open-sourced in 2017, and submitted to the Apache Incubator in January 2019.
“Learning and growing the Apache way in the incubator was a rewarding experience,” said Vinoth Chandar, Vice President of Apache Hudi. “As a community, we are humbled by how far we have advanced the project together, while at the same time, excited about the challenges ahead.”
Recommended AI News: Cradlepoint Revamps Partner Program to Build on Momentum and Prepare for the Wireless WAN and 5G Future
Apache Hudi is used to manage petabyte-scale data lakes using stream processing primitives like upserts and incremental change streams on Apache Hadoop Distributed File System (HDFS) or cloud stores. Hudi data lakes provide fresh data while being an order of magnitude efficient over traditional batch processing. Features include:
- Upsert/Delete support with fast, pluggable indexing
- Transactionally commit/rollback data
- Change capture from Hudi tables for stream processing
- Support for Apache Hive, Apache Spark, Apache Impala and Presto query engines
- Built-in data ingestion tool supporting Apache Kafka, Apache Sqoop and other common data sources
- Optimize query performance by managing file sizes, storage layout
- Fast row based ingestion format with async compaction into columnar format
- Timeline metadata for audit tracking
Apache Hudi is in use at organizations such as Alibaba Group, EMIS Health, Linknovate, Tathastu.AI, Tencent, and Uber, and is supported as part of Amazon EMR by Amazon Web Services.
“We are very pleased to see Apache Hudi graduate to an Apache Top-Level Project. Apache Hudi is supported in Amazon EMR release 5.28 and higher, and enables customers with data in Amazon S3 data lakes to perform record-level inserts, updates, and deletes for privacy regulations, change data capture (CDC), and simplified data pipeline development,” said Rahul Pathak, General Manager, Analytics, AWS. “We look forward to working with our customers and the Apache Hudi community to help advance the project.”
“At Uber, Hudi powers one of the largest transactional data lakes on the planet in near real time to provide meaningful experiences to users worldwide,” said Nishith Agarwal, member of the Apache Hudi Project Management Committee. “With over 150 petabytes of data and more than 500 billion records ingested per day, Uber’s use cases range from business critical workflows to analytics and machine learning.”
Recommended AI News: Factorin Reports $500 Million Transaction Turnover & 246k Processed Invoices Since in One Year
“Using Apache Hudi, end-users can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on HDFS/COS/CHDFS using Apache Parquet and Apache Avro,” said Felix Zheng, Lead of Cloud Real-Time Computing Service Technology at Tencent.
“As cloud infrastructure becomes more sophisticated, data analysis and computing solutions gradually begin to build data lake platforms based on cloud object storage and computing resources,” said Li Wei, Technical Lead on Data Lake Analytics, at Alibaba Cloud. “Apache Hudi is a very good incremental storage engine that helps users manage the data in the data lake in an open way and accelerate users’ computing and analysis.”
“Apache Hudi is a key building block for the Hopsworks Feature Store, providing versioned features, incremental and atomic updates to features, and indexed time-travel queries for features,” said Jim Dowling, CEO/Co-Founder at Logical Clocks. “The graduation of Hudi to a top-level Apache project is also the graduation of the open-source data lake from its earlier data swamp incarnation to a modern ACID-enabled, enterprise-ready data platform.”
“Hudi’s graduation to a top-level Apache project is a result of the efforts of many dedicated contributors in the Hudi community,” said Jennifer Anderson, Senior Director of Platform Engineering at Uber. “Hudi is critical to the performance and scalability of Uber’s big data infrastructure. We’re excited to see it gain traction and achieve this major milestone.”
“Thus far, Hudi has started a meaningful discussion in the industry about the wide gaps between data warehouses and data lakes. We have also taken strides to bridge some of them, with the help of the Apache community,” added Chandar. “But, we are only getting started with our deeply technical roadmap. We certainly look forward to a lot more contributions and collaborations from the community to get there. Everyone’s invited!”
Comments are closed, but trackbacks and pingbacks are open.