Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Wei Kang; Xiaoyu Yang; Zengwei Yao; Fangjun Kuang; Yifan Yang; Liyong Guo; Long Lin; Daniel Povey

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey

TL;DR

Libriheavy addresses the lack of large, richly annotated ASR resources by releasing a 50,000-hour corpus with full punctuation, casing, and preceding text context. It pairs Librilight audio with original texts through a general, open-source audio-alignment pipeline and provides labeled data in train/dev/test splits. Baseline experiments with CTC-Attention and Transducer models show the utility of training on punctuated, cased transcripts, with performance improvements depending on data size, demonstrating the dataset's value for contextualized ASR and downstream tasks requiring text context. Overall, the work enables robust, context-aware ASR development and lowers the barrier to constructing richly formatted ASR corpora.

Abstract

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casing and text context, which brings more flexibility for system building. Specifically, we propose a general and efficient pipeline to locate, align and segment the audios in previously published Librilight to its corresponding texts. The same as Librilight, Libriheavy also has three training subsets small, medium, large of the sizes 500h, 5000h, 50000h respectively. We also extract the dev and test evaluation sets from the aligned audios and guarantee there is no overlapping speakers and books in training sets. Baseline systems are built on the popular CTC-Attention and transducer models. Additionally, we open-source our dataset creatation pipeline which can also be used to other audio alignment tasks.

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

TL;DR

Abstract

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Authors

TL;DR

Abstract

Table of Contents

Figures (1)