Table of Contents
Fetching ...

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Yinghao Aaron Li, Xilin Jiang, Cong Han, Nima Mesgarani

TL;DR

StyleTTS-ZS tackles zero-shot text-to-speech by learning a fixed-length, time-varying style representation via diffusion conditioned on text prompts and a speaker prompt. It introduces a four-part architecture (acoustic synthesizer, vector-quantized prosody autoencoder, distilled time-varying style diffusion, and multimodal discriminators) and uses classifier-free guidance to improve similarity, while distilling the diffusion model to a one-shot sampler with only 10k samples. The method achieves 10–20x faster sampling with negligible loss in quality and surpasses previous large-scale zero-shot TTS models in naturalness and similarity on LibriTTS/LibriLight benchmarks. It enables practical real-time zero-shot TTS and broad applications, while acknowledging ethical considerations around misuse and deepfake risks.

Abstract

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at https://styletts-zs.github.io/.

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

TL;DR

StyleTTS-ZS tackles zero-shot text-to-speech by learning a fixed-length, time-varying style representation via diffusion conditioned on text prompts and a speaker prompt. It introduces a four-part architecture (acoustic synthesizer, vector-quantized prosody autoencoder, distilled time-varying style diffusion, and multimodal discriminators) and uses classifier-free guidance to improve similarity, while distilling the diffusion model to a one-shot sampler with only 10k samples. The method achieves 10–20x faster sampling with negligible loss in quality and surpasses previous large-scale zero-shot TTS models in naturalness and similarity on LibriTTS/LibriLight benchmarks. It enables practical real-time zero-shot TTS and broad applications, while acknowledging ethical considerations around misuse and deepfake risks.

Abstract

The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at https://styletts-zs.github.io/.
Paper Structure (27 sections, 24 equations, 3 figures, 11 tables)

This paper contains 27 sections, 24 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Overview of StyleTTS-ZS architecture. During training, the model uses ground truth speech to extract prosodic features and encode text and style with prompt speech. The prosody encoder compresses these features into a fixed-length time-varying style vector, which is regularized and decoded back by the prosody decoder. The style diffusion denoiser uses this vector for diffusion model training, and the decoder reconstructs speech using prosodic features, text embeddings, and global style, with multimodal discriminators assessing the output. Bold indicates system input, where speech prompts and phonemes are used for both style diffusion and acoustic synthesizer.
  • Figure 2: Architectures for newly proposed components in StyleTTS-ZS. For (b) and (d), the dark part of the output means this part is discarded for output, and only the grey part is used.
  • Figure 3: Effects of bottleneck complexity of prosody autoencoder and diffusion denoiser's performance. (a) Fixed time-varying style length $K = 50$ and varying codebook size. (b) Fixed codebook size of 1024 and varying style length $K$.