Table of Contents
Fetching ...

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

Keyu An, Zhiyu Zhang, Changfeng Gao, Yabin Li, Zhendong Peng, Haoxu Wang, Zhihao Du, Han Zhao, Zhifu Gao, Xiangang Li

TL;DR

MELA-TTS presents a unified transformer-diffusion framework for end-to-end TTS that omits speech tokenization and multi-stage pipelines by generating continuous mel-spectrogram frames. It introduces a representation alignment module that aligns Transformer decoder outputs with semantic ASR embeddings, improving content consistency and accelerating training. Through extensive experiments on LibriTTS and a large 170k-hour multilingual dataset, the approach achieves state-of-the-art or competitive WER/CER and solid zero-shot voice cloning, in both offline and streaming modes, and demonstrates strong scaling behavior. The work highlights the viability and advantages of continuous-feature generation in TTS, offering a compelling alternative to discrete-token-based methods and suggesting future applications in broader audio synthesis tasks.

Abstract

This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous feature generation approaches in TTS, offering a compelling alternative to discrete-token-based paradigms.

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

TL;DR

MELA-TTS presents a unified transformer-diffusion framework for end-to-end TTS that omits speech tokenization and multi-stage pipelines by generating continuous mel-spectrogram frames. It introduces a representation alignment module that aligns Transformer decoder outputs with semantic ASR embeddings, improving content consistency and accelerating training. Through extensive experiments on LibriTTS and a large 170k-hour multilingual dataset, the approach achieves state-of-the-art or competitive WER/CER and solid zero-shot voice cloning, in both offline and streaming modes, and demonstrates strong scaling behavior. The work highlights the viability and advantages of continuous-feature generation in TTS, offering a compelling alternative to discrete-token-based methods and suggesting future applications in broader audio synthesis tasks.

Abstract

This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous feature generation approaches in TTS, offering a compelling alternative to discrete-token-based paradigms.

Paper Structure

This paper contains 14 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The joint transformer and diffusion architecture. The autoregressive transformer decoder generates continuous vectors $\bf{h}$ as the condition to the diffusion model to generate the mel chunks.
  • Figure 2: Left: the diffusion module utilizes $\bf{h}$, along with speaker embeddings $\bf{v}$ and utterance embeddings $\bf{u}$ as conditional inputs, to perform mel-spectrogram denoising. $\bf{h}$, $\bf{v}$, and $\bf{u}$ are upsampled respectively to align with the chunk size of the mel-spectrogram. Right: the representation alignment module. $\bf{h}$ is also upsampled to align with the length of the pretrained semantic representation.
  • Figure 3: A diagram of the auto-regressive language model for streaming synthesis in MELA-TTS.
  • Figure 4: Comparison of WER over training epochs with and without representation alignment.
  • Figure 5: Subjective preference between MELA-TTS and CosyVoice.