Table of Contents
Fetching ...

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao

TL;DR

This work tackles robustness and naturalness in zero-shot TTS by introducing MegaTTS 3, a sparse-alignment guided latent diffusion transformer. It fuses coarse alignment anchors with a diffusion-based generator, augmented by PeRFlow acceleration and a multi-condition CFG that separately controls accent and speaker timbre. The approach achieves state-of-the-art zero-shot speech quality, strong accent controllability, and efficient 8-step generation, while showing improvements in prosodic naturalness and robustness to duration errors. The results demonstrate practical impact for high-quality, controllable TTS in multilingual, expressive applications, with attention to scalability and reproducibility.

Abstract

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

TL;DR

This work tackles robustness and naturalness in zero-shot TTS by introducing MegaTTS 3, a sparse-alignment guided latent diffusion transformer. It fuses coarse alignment anchors with a diffusion-based generator, augmented by PeRFlow acceleration and a multi-condition CFG that separately controls accent and speaker timbre. The approach achieves state-of-the-art zero-shot speech quality, strong accent controllability, and efficient 8-step generation, while showing improvements in prosodic naturalness and robustness to duration errors. The results demonstrate practical impact for high-quality, controllable TTS in multilingual, expressive applications, with attention to scalability and reproducibility.

Abstract

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces \textit{MegaTTS 3}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

Paper Structure

This paper contains 50 sections, 5 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: (a) The WaveVAE model; (b) Overview of our model. We insert the sparse alignment anchors into the latent vector sequence to provide coarse alignment information. The transformer blocks in MegaTTS 3 will automatically build fine-grained alignment paths.
  • Figure 2: The confusion matrices between the perceived and intended accent categories of synthesized speech. The X-axis and Y-axis represent the intended and perceived categories, respectively.
  • Figure 4: Sentence-level duration control.
  • Figure 5: Phoneme-level duration control.
  • Figure : (a) Screenshot of CMOS testing.
  • ...and 7 more figures