Table of Contents
Fetching ...

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani

TL;DR

StyleTTS 2 introduces a diffusion-based latent style model for TTS and leverages adversarial training with large speech language models to reach human-level naturalness on single- and multispeaker benchmarks. It enables end-to-end waveform synthesis through differentiable duration modeling and decouples content and style via prosodic encoders, achieving strong results with efficient diffusion steps. The approach demonstrates superior MOS/CMOS on LJSpeech and VCTK, with compelling zero-shot speaker adaptation on LibriTTS and robust generalization to out-of-distribution texts. These contributions offer a data-efficient and expressive route for high-fidelity TTS, while also highlighting evaluation and misuse considerations for real-world deployment.

Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

TL;DR

StyleTTS 2 introduces a diffusion-based latent style model for TTS and leverages adversarial training with large speech language models to reach human-level naturalness on single- and multispeaker benchmarks. It enables end-to-end waveform synthesis through differentiable duration modeling and decouples content and style via prosodic encoders, achieving strong results with efficient diffusion steps. The approach demonstrates superior MOS/CMOS on LJSpeech and VCTK, with compelling zero-shot speaker adaptation on LibriTTS and robust generalization to out-of-distribution texts. These contributions offer a data-efficient and expressive route for high-fidelity TTS, while also highlighting evaluation and misuse considerations for real-world deployment.

Abstract

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.
Paper Structure (40 sections, 48 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 40 sections, 48 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Training and inference scheme of StyleTTS 2 for the single-speaker case. For the multi-speaker case, the acoustic and prosodic style encoders (denoted as $\bm{E}$) first take reference audio $\bm{x}_\text{ref}$ of the target speaker and produce a reference style vector $\bm{c} = \bm{E}(\bm{x}_\text{ref})$. The style diffusion model then takes $\bm{c}$ as a reference to sample $\bm{s}_p$ and $\bm{s}_a$ that correspond to the speaker in $\bm{x}_\text{ref}$.
  • Figure 2: t-SNE visualization of style vectors sampled via style diffusion from texts in five emotions, showing that emotions are properly separated for seen and unseen speakers. (a) Clusters of emotion from styles sampled by the LJSpeech model. (b) Distinct clusters of styles sampled from 5 unseen speakers by the LibriTTS model. (c) Loose clusters of emotions from Speaker 1 in (b).
  • Figure 3: Histograms and kernel density estimation of the mean F0 and energy values of speech, synthesized with texts in five different emotions. The blue color ("Ground Truth") denotes the distributions of the ground truth samples in the test set. StyleTTS 2 shows distinct distributions for different emotions and produces samples that cover the entire range of the ground truth distributions.
  • Figure 4: Illustration of our proposed differentiable duration upsampler. (a) Probability output from the duration predictor for 5 input tokens with $L = 5$ . (b) Gaussian filter $\mathcal{N}_{\ell_{i-1}}$ centered at $\ell_{i-1}$. (c) Unnormalized predicted alignment $\tilde{f}_{a_i}[n]$ from the convolution operation between (a) and (b). (d) Normalized predicted alignment $\bm{a}_{\text{pred}}$ over the phoneme axis.
  • Figure 5: An example of duration predictor output and the predicted alignment with and without differentiable duration upsampler. (a) displays the log probability from the duration predictor for improved visualization. Although (b) and (c) differs, the duration predictor is trained end-to-end with the SLM discriminator, making the difference perceptually indistinguishable in synthesized speech.
  • ...and 2 more figures