Table of Contents
Fetching ...

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

Linhan Ma, Dake Guo, Kun Song, Yuepeng Jiang, Shuai Wang, Liumeng Xue, Weiming Xu, Huan Zhao, Binbin Zhang, Lei Xie

TL;DR

The paper introduces WenetSpeech4TTS, a 12,800-hour Mandarin TTS corpus derived from WenetSpeech, engineered to be speaker-homogeneous and high-quality for large-scale TTS training. It details a multi-step automatic processing pipeline (merging, boundary extension, denoising, diarization, transcription, quality filtering) and partitions data into Basic, Standard, and Premium subsets with DNSMOS-based scoring. Using VALL-E and NaturalSpeech 2 as baselines, the study shows that higher-quality subsets yield better objective and subjective TTS performance, with NaturalSpeech 2 generally achieving stronger intelligibility. The corpus and its benchmarks are publicly available, enabling fair evaluation and comparison of Mandarin TTS systems at scale.

Abstract

With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains $12,800$ hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.

WenetSpeech4TTS: A 12,800-hour Mandarin TTS Corpus for Large Speech Generation Model Benchmark

TL;DR

The paper introduces WenetSpeech4TTS, a 12,800-hour Mandarin TTS corpus derived from WenetSpeech, engineered to be speaker-homogeneous and high-quality for large-scale TTS training. It details a multi-step automatic processing pipeline (merging, boundary extension, denoising, diarization, transcription, quality filtering) and partitions data into Basic, Standard, and Premium subsets with DNSMOS-based scoring. Using VALL-E and NaturalSpeech 2 as baselines, the study shows that higher-quality subsets yield better objective and subjective TTS performance, with NaturalSpeech 2 generally achieving stronger intelligibility. The corpus and its benchmarks are publicly available, enabling fair evaluation and comparison of Mandarin TTS systems at scale.

Abstract

With the development of large text-to-speech (TTS) models and scale-up of the training data, state-of-the-art TTS systems have achieved impressive performance. In this paper, we present WenetSpeech4TTS, a multi-domain Mandarin corpus derived from the open-sourced WenetSpeech dataset. Tailored for the text-to-speech tasks, we refined WenetSpeech by adjusting segment boundaries, enhancing the audio quality, and eliminating speaker mixing within each segment. Following a more accurate transcription process and quality-based data filtering process, the obtained WenetSpeech4TTS corpus contains hours of paired audio-text data. Furthermore, we have created subsets of varying sizes, categorized by segment quality scores to allow for TTS model training and fine-tuning. VALL-E and NaturalSpeech 2 systems are trained and fine-tuned on these subsets to validate the usability of WenetSpeech4TTS, establishing baselines on benchmark for fair comparison of TTS systems. The corpus and corresponding benchmarks are publicly available on huggingface.
Paper Structure (15 sections, 5 figures, 2 tables)

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Demonstration of adjacent segments merging
  • Figure 2: Boundary extension
  • Figure 3: Distribution of DNSMOS P.808 scores for 10,000 random segments: enhanced (Red) vs. original (Blue).
  • Figure 4: The distribution of speech data quality. The horizontal axis represents DNSMOS P.808 scores, and the vertical axis represents the scale of data corresponding to the scores.
  • Figure 5: The distribution of segment lengths in WenetSpeech (a) and the WenetSpeech4TTS Basic subset (b).