Table of Contents
Fetching ...

Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems

Jeongmin Liu, Eunwoo Song

TL;DR

The paper tackles quality degradation when coupling universal vocoders with diverse TTS acoustic models due to exposure mismatch. It introduces a feature smoothing augmentation that randomly applies 2D linear smoothing filters to input mel-spectrograms during vocoder training, simulating the distribution of smoothed acoustic features without altering architectures or requiring fine-tuning. The method, implemented on an enhanced UnivNet framework with harmonic-noise generation and MS-STFT/CoMB discriminators, yields notable MOS gains (around 12%) across Tacotron 2 and FastSpeech 2 pipelines and generalizes to unseen speakers. This approach preserves vocoder universality and reduces exposure bias, offering practical improvements for high-quality TTS and potential applicability to other generation tasks such as singing voice and music synthesis.

Abstract

While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.

Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems

TL;DR

The paper tackles quality degradation when coupling universal vocoders with diverse TTS acoustic models due to exposure mismatch. It introduces a feature smoothing augmentation that randomly applies 2D linear smoothing filters to input mel-spectrograms during vocoder training, simulating the distribution of smoothed acoustic features without altering architectures or requiring fine-tuning. The method, implemented on an enhanced UnivNet framework with harmonic-noise generation and MS-STFT/CoMB discriminators, yields notable MOS gains (around 12%) across Tacotron 2 and FastSpeech 2 pipelines and generalizes to unseen speakers. This approach preserves vocoder universality and reduces exposure bias, offering practical improvements for high-quality TTS and potential applicability to other generation tasks such as singing voice and music synthesis.

Abstract

While universal vocoders have achieved proficient waveform generation across diverse voices, their integration into text-to-speech (TTS) tasks often results in degraded synthetic quality. To address this challenge, we present a novel augmentation technique for training universal vocoders. Our training scheme randomly applies linear smoothing filters to input acoustic features, facilitating vocoder generalization across a wide range of smoothings. It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output even when the acoustic model produces overly smoothed features. Notably, our method is applicable to any vocoder without requiring architectural modifications or dependencies on specific acoustic models. The experimental results validate the superiority of our vocoder over conventional methods, achieving 11.99% and 12.05% improvements in mean opinion scores when integrated with Tacotron 2 and FastSpeech 2 TTS acoustic models, respectively.
Paper Structure (14 sections, 7 equations, 4 figures, 3 tables)

This paper contains 14 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Block diagram of the vocoding process in the TTS framework: (a) conventional and (b) proposed methods.
  • Figure 2: Normalized histograms of the mel-spectral distance (MSD) between the ground-truth mel-spectrograms and those generated by Tacotron 2, or simulated using different sizes of smoothing filters. These results were obtained from the test set.
  • Figure 3: Mel-spectrograms of (a) ground-truth speech, (b) generated by the Tacotron 2 acoustic model, (c) simulated using the smoothing filter ($l_t$=5, $l_f$=1), and (d) simulated using another filter ($l_t$=5, $l_f$=3). The rectangular areas highlight instances in which our method simulates smoothings similar to those that occurred by the acoustic model.
  • Figure 4: The UnivNet architectures: (a) the vanilla UnivNet-c32 model and (b) the proposed eUnivNet model. The notations $c$ and $k$ denote the number of channels and the kernel size of the convolution layer, respectively.