Table of Contents
Fetching ...

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

Nick Rossenbach, Robin Schmitt, Tina Raissi, Simon Berger, Larissa Kleppel, Ralf Schlüter

TL;DR

This work presents supplementary resources for the Loquacious ASR dataset, including a large 216k-word vocabulary, a pronunciation lexicon with G2P-derived variants, and count-based language models. It benchmarks multiple architectures (CTC, RNN-T variants, AED, and a Factored Hybrid) across 250 hours and 2.5k hours, exploring decoding strategies and data augmentation. Key findings show that count-based LMs and lexicon-constrained decoding substantially improve WER, with phoneme-based decoding offering targeted advantages in some settings, and highlight the dataset's realistic challenges. The resources and analyses aim to facilitate fair benchmarking and broaden Usability for academia and industry.

Abstract

The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset

TL;DR

This work presents supplementary resources for the Loquacious ASR dataset, including a large 216k-word vocabulary, a pronunciation lexicon with G2P-derived variants, and count-based language models. It benchmarks multiple architectures (CTC, RNN-T variants, AED, and a Factored Hybrid) across 250 hours and 2.5k hours, exploring decoding strategies and data augmentation. Key findings show that count-based LMs and lexicon-constrained decoding substantially improve WER, with phoneme-based decoding offering targeted advantages in some settings, and highlight the dataset's realistic challenges. The resources and analyses aim to facilitate fair benchmarking and broaden Usability for academia and industry.

Abstract

The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.

Paper Structure

This paper contains 30 sections, 10 tables.