Table of Contents
Fetching ...

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

TL;DR

This work tackles the gap in spontaneous, real-world speech data for training high-fidelity speech generation models, which have been dominated by formal, read-aloud audio-books. It introduces Emilia-Pipe, an open-source preprocessing pipeline that standardizes, separates sources, diarizes speakers, segments with VAD, transcribes via ASR, and filters multilingual in-the-wild data to produce clean training samples. Using Emilia-Pipe, the authors construct Emilia (101k hours across six languages) and Emilia-Large (216k hours), demonstrating that in-the-wild data yields more natural and human-like speech than audio-book data while maintaining intelligibility; they also analyze data-scaling laws and multilingual/crosslingual performance. The results show substantial gains in spontaneity and speaker timbre richness with large-scale, multilingual data, offering a practical path to more natural TTS systems and highlighting open challenges in crosslingual transfer and safety. The work provides a valuable, reusable resource and framework for the community, with potential extensions to broader languages and domains.

Abstract

Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

TL;DR

This work tackles the gap in spontaneous, real-world speech data for training high-fidelity speech generation models, which have been dominated by formal, read-aloud audio-books. It introduces Emilia-Pipe, an open-source preprocessing pipeline that standardizes, separates sources, diarizes speakers, segments with VAD, transcribes via ASR, and filters multilingual in-the-wild data to produce clean training samples. Using Emilia-Pipe, the authors construct Emilia (101k hours across six languages) and Emilia-Large (216k hours), demonstrating that in-the-wild data yields more natural and human-like speech than audio-book data while maintaining intelligibility; they also analyze data-scaling laws and multilingual/crosslingual performance. The results show substantial gains in spontaneity and speaker timbre richness with large-scale, multilingual data, offering a practical path to more natural TTS systems and highlighting open challenges in crosslingual transfer and safety. The work provides a valuable, reusable resource and framework for the community, with potential extensions to broader languages and domains.

Abstract

Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.

Paper Structure

This paper contains 37 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: An overview of the Emilia-Pipe pipeline. It consists of six steps, namely, standardization, source separation, speaker diarization, fine-grained segmentation by voice activity detection (VAD), automated speech recognition (ASR), and filtering.
  • Figure 2: Duration statistics (in hours) of the speech data in Emilia and Emilia-Large by language. The numbers in parentheses indicate the scaling factor (multiples) of the speech data in Emilia-Large compared to the original Emilia dataset.
  • Figure 3: A comparison of acoustic and semantic diversities between Emilia and MLS datasets.
  • Figure 4: The relationship between Emilia and Emilia-Large. Emilia-Large is an extended version of Emilia, incorporating additional data primarily processed from YODAS2.
  • Figure 5: Model performance vs. training set size on LibriSpeech-Test (L) and Emilia-Test (E).