Table of Contents
Fetching ...

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

TL;DR

Emilia tackles the scarcity of spontaneous, diverse multilingual speech data by introducing a large-scale in-the-wild dataset and an open-source preprocessing pipeline. The six-step Emilia-Pipe pipeline standardizes, separates sources, diarizes, segments, transcribes, and filters raw audio to produce high-quality annotated data, achieving substantial throughput gains. The resulting Emilia dataset spans over 101k hours across six languages and demonstrates competitive performance in English TTS and strong multilingual transfer, outperforming many in-the-wild datasets after preprocessing. By open-sourcing both the dataset and the pipeline, the work enables scalable, community-driven advances in natural, spontaneous speech generation.

Abstract

Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation

TL;DR

Emilia tackles the scarcity of spontaneous, diverse multilingual speech data by introducing a large-scale in-the-wild dataset and an open-source preprocessing pipeline. The six-step Emilia-Pipe pipeline standardizes, separates sources, diarizes, segments, transcribes, and filters raw audio to produce high-quality annotated data, achieving substantial throughput gains. The resulting Emilia dataset spans over 101k hours across six languages and demonstrates competitive performance in English TTS and strong multilingual transfer, outperforming many in-the-wild datasets after preprocessing. By open-sourcing both the dataset and the pipeline, the work enables scalable, community-driven advances in natural, spontaneous speech generation.

Abstract

Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.
Paper Structure (21 sections, 3 figures, 5 tables)

This paper contains 21 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An overview of the Emilia-Pipe preprocessing pipeline.
  • Figure 2: Duration statistics of the speech data by language.
  • Figure 3: A comparison of acoustic and semantic diversities between Emilia and MLS datasets.