Table of Contents
Fetching ...

Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

Titouan Parcollet, Yuan Tseng, Shucong Zhang, Rogier van Dalen

TL;DR

LibriSpeech-inspired benchmarks are increasingly inadequate for real-world ASR research due to limited size and lack of noisy, spontaneous speech. The Loquacious Set delivers 25,000 hours of normalised, commercially usable English speech drawn from six sources, with multiple training subsets and robust evaluation splits, accompanied by reproducible SpeechBrain conformer recipes. Key contributions include comprehensive data normalisation, metadata handling, and a scalable pipeline for assembling diverse, license-compliant corpora, plus empirical evidence showing consistent performance gains with more data and larger models. This resource enables fair cross-dataset benchmarking and practical ASR development for both academia and industry in real-world conditions.

Abstract

Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People's Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.

Loquacious Set: 25,000 Hours of Transcribed and Diverse English Speech Recognition Data for Research and Commercial Use

TL;DR

LibriSpeech-inspired benchmarks are increasingly inadequate for real-world ASR research due to limited size and lack of noisy, spontaneous speech. The Loquacious Set delivers 25,000 hours of normalised, commercially usable English speech drawn from six sources, with multiple training subsets and robust evaluation splits, accompanied by reproducible SpeechBrain conformer recipes. Key contributions include comprehensive data normalisation, metadata handling, and a scalable pipeline for assembling diverse, license-compliant corpora, plus empirical evidence showing consistent performance gains with more data and larger models. This resource enables fair cross-dataset benchmarking and practical ASR development for both academia and industry in real-world conditions.

Abstract

Automatic speech recognition (ASR) research is driven by the availability of common datasets between industrial researchers and academics, encouraging comparisons and evaluations. LibriSpeech, despite its long success as an ASR benchmark, is now limited by its size and focus on clean, read speech, leading to near-zero word error rates. More recent datasets, including MOSEL, YODAS, Gigaspeech, OWSM, Libriheavy or People's Speech suffer from major limitations including licenses that researchers in the industry cannot use, unreliable transcriptions, incorrect audio data, or the lack of evaluation sets. This work presents the Loquacious Set, a 25,000-hour curated collection of commercially usable English speech. Featuring hundreds of thousands of speakers with diverse accents and a wide range of speech types (read, spontaneous, talks, clean, noisy), the Loquacious Set is designed to work for academics and researchers in the industry to build ASR systems in real-world scenarios.

Paper Structure

This paper contains 9 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Durations of the 9.6M samples of the large training split of the Loquacious Set. The average is 9.6 seconds.