Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu
TL;DR
This work tackles the gap in spontaneous, real-world speech data for training high-fidelity speech generation models, which have been dominated by formal, read-aloud audio-books. It introduces Emilia-Pipe, an open-source preprocessing pipeline that standardizes, separates sources, diarizes speakers, segments with VAD, transcribes via ASR, and filters multilingual in-the-wild data to produce clean training samples. Using Emilia-Pipe, the authors construct Emilia (101k hours across six languages) and Emilia-Large (216k hours), demonstrating that in-the-wild data yields more natural and human-like speech than audio-book data while maintaining intelligibility; they also analyze data-scaling laws and multilingual/crosslingual performance. The results show substantial gains in spontaneity and speaker timbre richness with large-scale, multilingual data, offering a practical path to more natural TTS systems and highlighting open challenges in crosslingual transfer and safety. The work provides a valuable, reusable resource and framework for the community, with potential extensions to broader languages and domains.
Abstract
Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.
