Table of Contents
Fetching ...

Hierarchical Generative Modeling for Controllable Speech Synthesis

Wei-Ning Hsu, Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, Zhifeng Chen, Jonathan Shen, Patrick Nguyen, Ruoming Pang

TL;DR

The paper tackles controlling rarely annotated speech attributes in neural TTS by introducing GMVAE-Tacotron, a hierarchical variational framework with two latent spaces: a discrete latent class for attribute groups and a continuous latent for fine-grained variation, plus an observed-attribute space for speaker-related nuances. By modeling latent attributes with a Gaussian mixture prior and integrating two variational posteriors, the approach achieves disentangled, interpretable control over style, accent, noise, and recording conditions, enabling sampling, one-shot speaker inference, and robust clean-speech synthesis from noisy data. Extensive experiments across four datasets show the model can independently manipulate speaker identity, noise level, and speaking style, while maintaining high naturalness, and demonstrate effective style transfer and cross-domain application (crowd-sourced and audiobook data). The work advances controllable TTS with principled sampling, improved interpretability, and practical implications for data augmentation and robust deployment in varied recording conditions.

Abstract

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style.

Hierarchical Generative Modeling for Controllable Speech Synthesis

TL;DR

The paper tackles controlling rarely annotated speech attributes in neural TTS by introducing GMVAE-Tacotron, a hierarchical variational framework with two latent spaces: a discrete latent class for attribute groups and a continuous latent for fine-grained variation, plus an observed-attribute space for speaker-related nuances. By modeling latent attributes with a Gaussian mixture prior and integrating two variational posteriors, the approach achieves disentangled, interpretable control over style, accent, noise, and recording conditions, enabling sampling, one-shot speaker inference, and robust clean-speech synthesis from noisy data. Extensive experiments across four datasets show the model can independently manipulate speaker identity, noise level, and speaking style, while maintaining high naturalness, and demonstrate effective style transfer and cross-domain application (crowd-sourced and audiobook data). The work advances controllable TTS with principled sampling, improved interpretability, and practical implications for data augmentation and robust deployment in varied recording conditions.

Abstract

This paper proposes a neural sequence-to-sequence text-to-speech (TTS) model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions. The model is formulated as a conditional generative model based on the variational autoencoder (VAE) framework, with two levels of hierarchical latent variables. The first level is a categorical variable, which represents attribute groups (e.g. clean/noisy) and provides interpretability. The second level, conditioned on the first, is a multivariate Gaussian variable, which characterizes specific attribute configurations (e.g. noise level, speaking rate) and enables disentangled fine-grained control over these attributes. This amounts to using a Gaussian mixture model (GMM) for the latent distribution. Extensive evaluation demonstrates its ability to control the aforementioned attributes. In particular, we train a high-quality controllable TTS model on real found data, which is capable of inferring speaker and style attributes from a noisy utterance and use it to synthesize clean speech with controllable speaking style.

Paper Structure

This paper contains 47 sections, 11 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Graphical model representation of the proposed models. Observed class often corresponds to the speaker label. The left illustrates equation \ref{['eq:gen1']}, and the right illustrates the extension from Section \ref{['sec:obs_attr_repr']}. The grey and white nodes correspond to observed and latent variables.
  • Figure 2: Training configuration of the GMVAE-Tacotron model. Dashed lines denotes sampling. The model is comprised of three modules: a synthesizer, a latent encoder, and an observed encoder.
  • Figure 3: Assignment distribution over ${\mathbf{y}}_{l}$ for each gender (upper) and for each accent (lower).
  • Figure 4: Left: Euclidean distance between the means of each mixture component pair. Right: Decoding the same text conditioned on the mean of a noisy (center) and a clean component (right).
  • Figure 5: SNR as a function of the value in each latent dimension, comparing clean (left) and noisy (right) components.
  • ...and 11 more figures