Table of Contents
Fetching ...

Towards Robust FastSpeech 2 by Modelling Residual Multimodality

Fabian Kögel, Bac Nguyen, Fabien Cardinaux

TL;DR

The paper identifies that FastSpeech 2's mean-squared-error objective induces over-smoothing on expressive speech due to residual multimodality in mel-spectrograms. It introduces TVC-GMM, a Trivariate-Chain Gaussian Mixture modelling layer, to capture local time-frequency dependencies and multimodal residuals, trained with negative log-likelihood on triplet targets and two shifted copies. Sampling strategies include naive triplet sampling and a conditional approach to reduce noise, enabling more natural spectrograms without sacrificing FastSpeech 2's efficiency. Across LJSpeech, LibriTTS, and VCTK with HiFiGAN vocoders, TVC-GMM reduces spectrogram smoothness and improves perceptual quality, particularly for expressive data, while acknowledging remaining gaps tied to duration and variance prediction.

Abstract

State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.

Towards Robust FastSpeech 2 by Modelling Residual Multimodality

TL;DR

The paper identifies that FastSpeech 2's mean-squared-error objective induces over-smoothing on expressive speech due to residual multimodality in mel-spectrograms. It introduces TVC-GMM, a Trivariate-Chain Gaussian Mixture modelling layer, to capture local time-frequency dependencies and multimodal residuals, trained with negative log-likelihood on triplet targets and two shifted copies. Sampling strategies include naive triplet sampling and a conditional approach to reduce noise, enabling more natural spectrograms without sacrificing FastSpeech 2's efficiency. Across LJSpeech, LibriTTS, and VCTK with HiFiGAN vocoders, TVC-GMM reduces spectrogram smoothness and improves perceptual quality, particularly for expressive data, while acknowledging remaining gaps tied to duration and variance prediction.

Abstract

State-of-the-art non-autoregressive text-to-speech (TTS) models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. For expressive speech datasets however, we observe characteristic audio distortions. We demonstrate that such artefacts are introduced to the vocoder reconstruction by over-smooth mel-spectrogram predictions, which are induced by the choice of mean-squared-error (MSE) loss for training the mel-spectrogram decoder. With MSE loss FastSpeech 2 is limited to learn conditional averages of the training distribution, which might not lie close to a natural sample if the distribution still appears multimodal after all conditioning signals. To alleviate this problem, we introduce TVC-GMM, a mixture model of Trivariate-Chain Gaussian distributions, to model the residual multimodality. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets as shown by both objective and subjective evaluation.
Paper Structure (12 sections, 2 equations, 4 figures, 3 tables)

This paper contains 12 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples of residual multimodality observed in marginal distributions over time and frequency for phoneme $ph$ after all pitch $p$ and energy $e$ conditioning in FastSpeech 2.
  • Figure 2: Illustration of approaches. TVC-GMM (top) models dependencies of adjacent spectrogram bins by a trivariate gaussian chain while FastSpeech 2 (bottom) only models means.
  • Figure 3: Aligned synthesized mel-spectrogram samples for all datasets and models. FastSpeech 2 models are visibly oversmooth, while TVC-GMM models are closer to ground-truth. Conditional sampling reduces the sampling noise introduced by naive sampling.
  • Figure 4: Pitch range of datasets. LibriTTS and VCTK are more diverse than LJSpeech.