Table of Contents
Fetching ...

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Xiang Li, Fan Bu, Ambuj Mehrish, Yingting Li, Jiale Han, Bo Cheng, Soujanya Poria

TL;DR

CM-TTS reframes speech synthesis as a consistency-model problem to achieve real-time, high-quality mel-spectrogram generation without teacher distillation or adversarial training. It introduces a four-component architecture (Phoneme encoder, Variance adaptor, CM-Decoder, and HiFi-GAN vocoder) and leverages a consistency constraint with two decoders (online and target) trained via EMA, including a weighted sampler to mitigate sampling biases. The method unifies training via a CT loss and a mel reconstruction loss, while enabling single-step or few-step generation, leading to strong objective and subjective performance across multi- and zero-shot scenarios. Empirical results on VCTK, LJSpeech, and LibriSpeech show CM-TTS outperforms baselines, especially in single-step synthesis, and demonstrates robust generalization and training stability thanks to consistency training and IS-based sampling."

Abstract

Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

TL;DR

CM-TTS reframes speech synthesis as a consistency-model problem to achieve real-time, high-quality mel-spectrogram generation without teacher distillation or adversarial training. It introduces a four-component architecture (Phoneme encoder, Variance adaptor, CM-Decoder, and HiFi-GAN vocoder) and leverages a consistency constraint with two decoders (online and target) trained via EMA, including a weighted sampler to mitigate sampling biases. The method unifies training via a CT loss and a mel reconstruction loss, while enabling single-step or few-step generation, leading to strong objective and subjective performance across multi- and zero-shot scenarios. Empirical results on VCTK, LJSpeech, and LibriSpeech show CM-TTS outperforms baselines, especially in single-step synthesis, and demonstrates robust generalization and training stability thanks to consistency training and IS-based sampling."

Abstract

Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.
Paper Structure (38 sections, 11 equations, 7 figures, 10 tables)

This paper contains 38 sections, 11 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: (a) CM-TTS architecture. (b) Decoder training scheme, where $f_{\theta}$ is parameterized to satisfy consistency constrain disucssed in Eq. \ref{['property']}. (c) ODE trajectory during training.
  • Figure 2: Single-step and multi-step inference utilizing the CM-TTS. For multi-step generation, process of alternating denoising and noise injection steps is executed iteratively until the desired number of steps is achieved.
  • Figure 3: An Illustration of the Convergence of Loss Across DiffGAN-TTS and CM-TTS.
  • Figure 4: Convergence of loss across different Samplers.
  • Figure 5: The trend of DiffGAN-TTS and CM-TTS on the mfcc-FID metric during training on VCTK.
  • ...and 2 more figures