Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

Frederik Rautenberg; Jana Wiechmann; Petra Wagner; Reinhold Haeb-Umbach

Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

Frederik Rautenberg, Jana Wiechmann, Petra Wagner, Reinhold Haeb-Umbach

TL;DR

A system capable of faithfully modifying the perceptual voice quality of creak while preserving the speaker's perceived identity is introduced and shows greatly improved speaker verification performance over a range of creak manipulation strengths.

Abstract

We introduce a system capable of faithfully modifying the perceptual voice quality of creak while preserving the speaker's perceived identity. While it is well known that high creak probability is typically correlated with low pitch, it is important to note that this is a property observed on a population of speakers but does not necessarily hold across all situations. Disentanglement of pitch from creak is achieved by augmentation of the training dataset of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flow. The experiments show greatly improved speaker verification performance over a range of creak manipulation strengths.

Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

TL;DR

Abstract

Paper Structure (7 sections, 5 equations, 3 figures, 2 tables)

This paper contains 7 sections, 5 equations, 3 figures, 2 tables.

Introduction
System Description
Dataset adaptation
Voice quality control
Acoustic Feature Analysis
Speaker verification under creak manipulation
Conclusion

Figures (3)

Figure 1: TTS inference with a speaker embedding manipulation block, where $\boldsymbol{\mathbf{a}}$ is the creak probability of $\boldsymbol{\mathbf{x}}$ and $\boldsymbol{\mathbf{a}} + \boldsymbol{\mathbf{\tilde{a}}}$ its modified strength
Figure 2: Distribution of pitch for male (m) and female (f) speakers, where samples are grouped into creak (Crk) and non-creak (NCrk) categories, before and after adaptation
Figure 3: EER as a function of the creak manipulation factor $\beta$ for the base, the adapted and adapted-2 models

Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

TL;DR

Abstract

Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)