MLCommons Association Unveils Open Datasets and Tools to Drive Democratization of Machine Learning

By AIT News Desk On Dec 22, 2021

The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, announced the general availability of the People’s Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). These trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI.

The People’s Speech Dataset

The People’s Speech Dataset is among the world’s largest English speech recognition datasets licensed for academic and commercial usage. The 30,000-hour supervised conversational dataset is an order of magnitude larger than what was available just a few years ago. The dataset, released under a Creative Commons license, democratizes access to speech technology such as voice assistants and transcription, and unlocks innovation in the machine learning community. Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech.

Multilingual Spoken Words Corpus

Also available today is the Multilingual Spoken Words Corpus (MSWC), a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples. Previous datasets relied on manual efforts to collect and validate thousands of utterances for each keyword and were commonly restricted to a single language. A diverse multilingual dataset that spans languages spoken by over five billion people, MSWC advances the research and development of applications such as voice interfaces for a broad global audience. Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words.

Colle AI Develops Advanced Prototyping Frameworks to Boost NFT Creation Speed

Sep 26, 2025

AGII Introduces Realtime AI Intelligence to Accelerate Web3 Execution

Sep 26, 2025

GPT Proto Makes Enhanced Gemini 2.5 Flash Available Following Google’s Major AI Update

Sep 26, 2025

Prev Next 1 of 42,081

DataPerf

The new DataPerf benchmark suite supports data-centric AI innovation by measuring the quality of datasets for common ML tasks and the impact of enhancing datasets. Training and test datasets are a key part of creating an ML solution — the solution can only be as good as the data — but much less effort is spent on understanding and improving datasets than on mastering and improving models. DataPerf fosters and measures progress in this vital area. The MLCommons Association will support a series of challenges with leaderboards in 2022 to encourage participation in DataPerf. Contributors to the suite include researchers from Alibaba, Coactive.AI, ETH Zurich, Google, Harvard University, Landing.AI, Meta, Stanford University, and TU Eindhoven, drawing on the teams responsible for Cats4ML, the Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf™ benchmarks. The MLCommons Association invites other participants to join the DataPerf effort at dataperf.ai.

Historically, most AI research has focused on improving model architectures and making them available to the community; in contrast, attention to engineering and maintaining datasets has lagged and is often manual and ad-hoc. The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation.

“The machine learning model architecture for many applications is basically a solved problem. In many cases, focusing on engineering the data is more important for unlocking successful AI applications. Data is food for AI, and our systems need not just massive amounts of calories, but also high-quality nutrition. We need not just big data, but good data,” said Andrew Ng, founder and CEO of Landing AI, founding lead of Google Brain, co-founder and chairman of Coursera, and adjunct professor at Stanford University. “Thanks to the shared efforts by the community, including the work initiated by the MLCommons Association and its members, the movement demonstrates the potential for Data-Centric AI, and how we can collectively implement a greater AI adoption.”

“Speech technology can empower billions of people across the planet, but there’s a real need for large, open, and diverse datasets to catalyze innovation,” said David Kanter, the MLCommons Association co-founder and executive director. “The People’s Speech is a large-scale dataset in English while MSWC offers a tremendous breadth of languages. I’m excited for these datasets to improve everyday experiences like voice-enabled consumer devices and speech recognition.”

Recommended AI News: Abacus.AI Named a 2021 Gartner Cool Vendor

[To share your insights with us, please write to sghosh@martechseries.com]

MLCommons Association Unveils Open Datasets and Tools to Drive Democratization of Machine Learning

The People’s Speech Dataset

Multilingual Spoken Words Corpus

DataPerf

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy

MLCommons Association Unveils Open Datasets and Tools to Drive Democratization of Machine Learning

The People’s Speech Dataset

Multilingual Spoken Words Corpus

DataPerf

Quick Links

Visit Our Other Sites

Follow Us

Interested in our Customized Editorial Services?

﻿Please fill your details and we’ll get in touch with you!

NEWS

INTERVIEWS

INSIGHTS

AI RADAR

SERVICES

SUBSCRIBE

CONTACT US

Brought to you by

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought. Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy

Please fill your details and we’ll get in touch with you!

To repurpose or use any of the content or material on this and our sister sites, explicit written permission needs to be sought.

Copyright © 2025 AiThority. All Rights Reserved. Privacy Policy