Table of Contents
Fetching ...

The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech

Julio Cesar Galdino, Sidney Evaldo Leal, Leticia Gabriella De Souza, Rodrigo de Freitas Lima, Antonio Nelson Fornari Mendes Moreira, Arnaldo Candido Junior, Miguel Oliveira, Edresson Casanova, Sandra M. Aluísio

TL;DR

This work addresses how explicit prosodic boundary annotations influence spontaneous speech synthesis in Brazilian Portuguese, by comparing manual prosodic segmentation with automatic WhisperX-based segmentation using a non-autoregressive $TTS$ model. It leverages the NURC-SP Minimal Corpus and augments training with the CML-TTS Portuguese data to train two FastSpeech 2 pipelines, evaluating intelligibility via $WER$/$CER$ and acoustic similarity through F0 contour analyses. Results show that manual prosodic segmentation provides modest gains in intelligibility and closer replication of natural pitch patterns, while automatic segmentation achieves more uniform segments but flatter prosody, indicating that current automatic prosody annotation does not fully capture expressive variability. The study underscores the value of explicit prosodic annotation for spontaneous-speech TTS and points to the need for larger corpora and improved automatic prosody segmentation to bridge the gap to naturalness, with datasets and code publicly available under CC BY-NC-ND 4.0.

Abstract

Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.

The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech

TL;DR

This work addresses how explicit prosodic boundary annotations influence spontaneous speech synthesis in Brazilian Portuguese, by comparing manual prosodic segmentation with automatic WhisperX-based segmentation using a non-autoregressive model. It leverages the NURC-SP Minimal Corpus and augments training with the CML-TTS Portuguese data to train two FastSpeech 2 pipelines, evaluating intelligibility via / and acoustic similarity through F0 contour analyses. Results show that manual prosodic segmentation provides modest gains in intelligibility and closer replication of natural pitch patterns, while automatic segmentation achieves more uniform segments but flatter prosody, indicating that current automatic prosody annotation does not fully capture expressive variability. The study underscores the value of explicit prosodic annotation for spontaneous-speech TTS and points to the need for larger corpora and improved automatic prosody segmentation to bridge the gap to naturalness, with datasets and code publicly available under CC BY-NC-ND 4.0.

Abstract

Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic segmentation tends to create more regular segments, manual prosodic segmentation introduces greater variability, which contributes to more natural prosody. Analysis of neutral declarative utterances showed that both training approaches reproduced the expected nuclear accent pattern, but the prosodic model aligned more closely with natural pre-nuclear contours. To support reproducibility and future research, all datasets, source codes, and trained models are publicly available under the CC BY-NC-ND 4.0 license.

Paper Structure

This paper contains 12 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Excerpt from the SP_EF_153 inquiry with five layers annotated in Praat: the first layer is used to indicate the punctuation that ends each TB (. ? ! … ), the second contains the normalized excerpt, i.e. without the annotation used for transcription in the NURC project, the next two for each speaker that appears in the inquiry (TB-L1, NTB-L1) and the last one for comments on the audio recording (com) Santos_etal_2022.
  • Figure 2: Automatic vs. Prosodic: Averages by Token and Segment
  • Figure 3: Precision, Recall, and F1-Score by Inquiry.
  • Figure 4: Average of four F0 contour points in the original audio and trained TTS models