Table of Contents
Fetching ...

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Xiaopeng Wang, Chunyu Qiang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Yukun Liu, Yuzhe Liang, Kang Yin, Yuankun Xie, Heng Xie, Chenxing Li, Chen Zhang, Changsheng Li

TL;DR

M3-TTS tackles the longstanding challenge of unreliable alignment in non-autoregressive TTS by introducing a joint Multi-Modal Diffusion Transformer and a Mel-VAE latent target that enables monotonic, padding-free text–speech alignment. The two-stage diffusion framework, with Joint-DiT for cross-modal alignment and Single-DiT for refinement, coupled with an ODE-based generation process, yields high-fidelity 44.1 kHz speech in a zero-shot setting. Empirical results on Seed-TTS and AISHELL-3 demonstrate state-of-the-art NAR performance, with favorable trade-offs between intelligibility, naturalness, and efficiency when using the Mel-VAE latent target. The approach reduces memory and computation while maintaining competitive quality, highlighting the practicality of diffusion-based cross-modal TTS for real-time, high-fidelity synthesis.

Abstract

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\% English, 1.31\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

TL;DR

M3-TTS tackles the longstanding challenge of unreliable alignment in non-autoregressive TTS by introducing a joint Multi-Modal Diffusion Transformer and a Mel-VAE latent target that enables monotonic, padding-free text–speech alignment. The two-stage diffusion framework, with Joint-DiT for cross-modal alignment and Single-DiT for refinement, coupled with an ODE-based generation process, yields high-fidelity 44.1 kHz speech in a zero-shot setting. Empirical results on Seed-TTS and AISHELL-3 demonstrate state-of-the-art NAR performance, with favorable trade-offs between intelligibility, naturalness, and efficiency when using the Mel-VAE latent target. The approach reduces memory and computation while maintaining competitive quality, highlighting the practicality of diffusion-based cross-modal TTS for real-time, high-fidelity synthesis.

Abstract

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\% English, 1.31\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.

Paper Structure

This paper contains 13 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of M3-TTS: training (left) and inference (right). Text is encoded into $T$, and the reference speech is encoded by a Mel--VAE into latents $x_1$. Noise $x_0$ and $x_1$ are linearly interpolated to form $x_t$, and conditioning is applied with a global token $c_g$ and a frame-level token $c_f=c_g+(1-m)\odot x_1$. Joint--DiT aligns $[x_t;T]$ in a unified attention space and splits the output into $(H^{a},H^{t})$. Single--DiT refines only the speech branch $H^{a}$. During inference, the output length using the reference speech-to-text ratio; an ODE solver integrates from noise to a latent; the Mel decoder then decodes it into a spectrogram.
  • Figure 2: Joint--DiT cross-modal attention visualization for four test samples. Each heatmap shows attention from speech tokens (rows) to text tokens (columns); red dots indicate the row-wise argmax.