Table of Contents
Fetching ...

Speech Synthesis along Perceptual Voice Quality Dimensions

Frederik Rautenberg, Michael Kuhlmann, Fritz Seebauer, Jana Wiechmann, Petra Wagner, Reinhold Haeb-Umbach

TL;DR

This work introduces a Conditional Continuous Normalizing Flow (CNF) framework embedded in a TTS system to controllably manipulate perceptual voice qualities (PVQs) such as Breathiness, Roughness, Resonance, and Weight. By conditioning a CNF on PVQ attributes and integrating it with YourTTS, the authors enable continuous PVQ edits that apply to both seen and unseen speakers, using PVQD+ and pseudo-labeled LibriTTS-R data for training. Objective and expert evaluations show that PVQ intensities can be modulated with varying success across qualities, with breathiness more reliably controllable than roughness and some trade-offs in speaker identity and gender perception. The approach holds promise as a educational tool for speech pathologists and voice actors, enabling targeted PVQ exemplars while preserving overall audio quality, with future work aimed at broader data, user studies, and richer PVQ interdependencies.

Abstract

While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.

Speech Synthesis along Perceptual Voice Quality Dimensions

TL;DR

This work introduces a Conditional Continuous Normalizing Flow (CNF) framework embedded in a TTS system to controllably manipulate perceptual voice qualities (PVQs) such as Breathiness, Roughness, Resonance, and Weight. By conditioning a CNF on PVQ attributes and integrating it with YourTTS, the authors enable continuous PVQ edits that apply to both seen and unseen speakers, using PVQD+ and pseudo-labeled LibriTTS-R data for training. Objective and expert evaluations show that PVQ intensities can be modulated with varying success across qualities, with breathiness more reliably controllable than roughness and some trade-offs in speaker identity and gender perception. The approach holds promise as a educational tool for speech pathologists and voice actors, enabling targeted PVQ exemplars while preserving overall audio quality, with future work aimed at broader data, user studies, and richer PVQ interdependencies.

Abstract

While expressive speech synthesis or voice conversion systems mainly focus on controlling or manipulating abstract prosodic characteristics of speech, such as emotion or accent, we here address the control of perceptual voice qualities (PVQs) recognized by phonetic experts, which are speech properties at a lower level of abstraction. The ability to manipulate PVQs can be a valuable tool for teaching speech pathologists in training or voice actors. In this paper, we integrate a Conditional Continuous-Normalizing-Flow-based method into a Text-to-Speech system to modify perceptual voice attributes on a continuous scale. Unlike previous approaches, our system avoids direct manipulation of acoustic correlates and instead learns from examples. We demonstrate the system's capability by manipulating four voice qualities: Roughness, breathiness, resonance and weight. Phonetic experts evaluated these modifications, both for seen and unseen speaker conditions. The results highlight both the system's strengths and areas for improvement.
Paper Structure (10 sections, 5 equations, 2 figures, 3 tables)

This paper contains 10 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: system during inference casanova2022yourtts. A speaker manipulation block is added to the system (dashed block). $\boldsymbol{\mathbf{a}}$ are the estimated attributes from the input utterance $\boldsymbol{\mathbf{x}}$ and $\boldsymbol{\mathbf{\tilde{a}}}$ the desired ones of $\hat{\boldsymbol{\mathbf{x}}}$.
  • Figure 2: Speaker embedding manipulation block, where $\mathrm{ODESolve} \left(\cdot\right)$ gives a solution for the initial value problem