Table of Contents
Fetching ...

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, Zhendong Peng

TL;DR

WenetSpeech tackles the scarcity of large-scale, diverse Mandarin ASR data by building a 22k+ hour multilingual-like Mandarin corpus through a two-track data collection pipeline (YouTube OCR for captions and Podcast ASR for transcripts) and a novel end-to-end label-error-detection validation step. The dataset is organized into high-quality labeled, weakly labeled, and unlabeled portions with explicit confidence scores, and is accompanied by three carefully designed evaluation sets for robust benchmarking. The authors provide baseline results for Kaldi, ESPnet, and WeNet to demonstrate utility across major toolkits and highlight improvements with larger, more diverse training data. This resource aims to accelerate research toward production-grade Mandarin ASR with broader domain coverage and real-world robustness.

Abstract

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

TL;DR

WenetSpeech tackles the scarcity of large-scale, diverse Mandarin ASR data by building a 22k+ hour multilingual-like Mandarin corpus through a two-track data collection pipeline (YouTube OCR for captions and Podcast ASR for transcripts) and a novel end-to-end label-error-detection validation step. The dataset is organized into high-quality labeled, weakly labeled, and unlabeled portions with explicit confidence scores, and is accompanied by three carefully designed evaluation sets for robust benchmarking. The authors provide baseline results for Kaldi, ESPnet, and WeNet to demonstrate utility across major toolkits and highlight improvements with larger, more diverse training data. This resource aims to accelerate research toward production-grade Mandarin ASR with broader domain coverage and real-world robustness.

Abstract

In this paper, we present WenetSpeech, a multi-domain Mandarin corpus consisting of 10000+ hours high-quality labeled speech, 2400+ hours weakly labeled speech, and about 10000 hours unlabeled speech, with 22400+ hours in total. We collect the data from YouTube and Podcast, which covers a variety of speaking styles, scenarios, domains, topics, and noisy conditions. An optical character recognition (OCR) based method is introduced to generate the audio/text segmentation candidates for the YouTube data on its corresponding video captions, while a high-quality ASR transcription system is used to generate audio/text pair candidates for the Podcast data. Then we propose a novel end-to-end label error detection approach to further validate and filter the candidates. We also provide three manually labelled high-quality test sets along with WenetSpeech for evaluation -- Dev for cross-validation purpose in training, Test_Net, collected from Internet for matched test, and Test\_Meeting, recorded from real meetings for more challenging mismatched test. Baseline systems trained with WenetSpeech are provided for three popular speech recognition toolkits, namely Kaldi, ESPnet, and WeNet, and recognition results on the three test sets are also provided as benchmarks. To the best of our knowledge, WenetSpeech is the current largest open-sourced Mandarin speech corpus with transcriptions, which benefits research on production-level speech recognition.

Paper Structure

This paper contains 19 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: OCR based YouTube data collection pipeline
  • Figure 2: Example outputs of the OCR pipeline
  • Figure 3: An example force alignment graph L of "不忘初心"
  • Figure 4: Examples of label error detection