Table of Contents
Fetching ...

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

Kyra Wang, Dorien Herremans

TL;DR

This work addresses the lack of datasets containing semantically meaningful paralanguage by introducing DisfluencySpeech, a nearly 10-hour, single-speaker English dataset annotated for disfluencies and non-lexical sounds derived from Switchboard. It provides three transcript variants at different information-removal levels and baseline benchmark models to study how paralanguage can be synthesized from text. Objective evaluations reveal that models trained on the most complete transcript (A) can converge and reproduce some non-speech components, but performance degrades sharply when transcript information is removed (B and C), highlighting alignment challenges. The dataset, along with MFA resources, a HiFiGAN vocoder, and Wav2Vec 2.0 ASR-based CER metrics, enables open research into semantically-meaningful paralanguage synthesis, with potential impact on more natural and context-aware TTS systems.

Abstract

Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not include transcribed non-lexical speech sounds and disfluencies, while those that do are typically multi-speaker datasets where each speaker provides relatively little audio. This makes it challenging to train conversational Text-to-Speech (TTS) synthesis models that include such paralinguistic components. We thus present DisfluencySpeech, a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard), simulating realistic informal conversations. To aid the development of a TTS model that is able to predictively synthesise paralanguage from text without such components, we provide three different transcripts at different levels of information removal (removal of non-speech events, removal of non-sentence elements, and removal of false starts), as well as benchmark TTS models trained on each of these levels.

DisfluencySpeech -- Single-Speaker Conversational Speech Dataset with Paralanguage

TL;DR

This work addresses the lack of datasets containing semantically meaningful paralanguage by introducing DisfluencySpeech, a nearly 10-hour, single-speaker English dataset annotated for disfluencies and non-lexical sounds derived from Switchboard. It provides three transcript variants at different information-removal levels and baseline benchmark models to study how paralanguage can be synthesized from text. Objective evaluations reveal that models trained on the most complete transcript (A) can converge and reproduce some non-speech components, but performance degrades sharply when transcript information is removed (B and C), highlighting alignment challenges. The dataset, along with MFA resources, a HiFiGAN vocoder, and Wav2Vec 2.0 ASR-based CER metrics, enables open research into semantically-meaningful paralanguage synthesis, with potential impact on more natural and context-aware TTS systems.

Abstract

Laughing, sighing, stuttering, and other forms of paralanguage do not contribute any direct lexical meaning to speech, but they provide crucial propositional context that aids semantic and pragmatic processes such as irony. It is thus important for artificial social agents to both understand and be able to generate speech with semantically-important paralanguage. Most speech datasets do not include transcribed non-lexical speech sounds and disfluencies, while those that do are typically multi-speaker datasets where each speaker provides relatively little audio. This makes it challenging to train conversational Text-to-Speech (TTS) synthesis models that include such paralinguistic components. We thus present DisfluencySpeech, a studio-quality labeled English speech dataset with paralanguage. A single speaker recreates nearly 10 hours of expressive utterances from the Switchboard-1 Telephone Speech Corpus (Switchboard), simulating realistic informal conversations. To aid the development of a TTS model that is able to predictively synthesise paralanguage from text without such components, we provide three different transcripts at different levels of information removal (removal of non-speech events, removal of non-sentence elements, and removal of false starts), as well as benchmark TTS models trained on each of these levels.
Paper Structure (9 sections, 1 figure, 2 tables)

This paper contains 9 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Example of the three different ways that the same clip is transcribed in the transcripts. Blue represents a filled pause, and red a false start.