Table of Contents
Fetching ...

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen, Aurosweta Mahapatra, Ali N. Salman, Carlos Busso, Berrak Sisman

TL;DR

NaturalVoices addresses the lack of large-scale spontaneous expressive data for voice conversion by providing 5,049 hours of podcast speech with automated, multi-level annotations for emotion, quality, transcripts, and sound events. The authors introduce an open-source data-sourcing pipeline and perform extensive VC and emotional VC experiments to evaluate data scaling, out-of-domain generalization, and emotion transfer. Key findings show that while larger spontaneous datasets enable robust speaker similarity and expressive transfer, current VC architectures struggle to fully exploit such data, revealing a gap between real-world speech and model capabilities. The dataset offers a valuable benchmark for robust, expressive VC and related affective speech tasks, with broad applications in TTS, anti-spoofing, enhancement, and audio understanding.

Abstract

Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

TL;DR

NaturalVoices addresses the lack of large-scale spontaneous expressive data for voice conversion by providing 5,049 hours of podcast speech with automated, multi-level annotations for emotion, quality, transcripts, and sound events. The authors introduce an open-source data-sourcing pipeline and perform extensive VC and emotional VC experiments to evaluate data scaling, out-of-domain generalization, and emotion transfer. Key findings show that while larger spontaneous datasets enable robust speaker similarity and expressive transfer, current VC architectures struggle to fully exploit such data, revealing a gap between real-world speech and model capabilities. The dataset offers a valuable benchmark for robust, expressive VC and related affective speech tasks, with broad applications in TTS, anti-spoofing, enhancement, and audio understanding.

Abstract

Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab

Paper Structure

This paper contains 63 sections, 11 figures, 10 tables.

Figures (11)

  • Figure 1: An illustration of the proposed NaturalVoices Dataset with the automatic processing pipeline.
  • Figure 2: An illustration of our pipeline processing NaturalVoices dataset with various modules, which includes speaker diarization, speech recognition, speech quality evaluation, emotion attribute and category prediction, and sound event detection.
  • Figure 3: Segment-level annotation from the NaturalVoices dataset. Shown is a 2 second speech segment from document MSP-PODCAST_0001_93. Each segment entry includes speech quality metrics, emotion and speaker attributes, and additional metadata describing the acoustic and contextual characteristics of the audio.
  • Figure 4: Distributions of sampling rates, utterance durations, and speaker gender in NaturalVoices.
  • Figure 5: Lexical and sentiment analysis of NaturalVoices: (a) word cloud of frequent terms, (b) distribution of text sentiment categories.
  • ...and 6 more figures