MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

Keyu An; Zhiyu Zhang; Changfeng Gao; Yabin Li; Zhendong Peng; Haoxu Wang; Zhihao Du; Han Zhao; Zhifu Gao; Xiangang Li

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

Keyu An, Zhiyu Zhang, Changfeng Gao, Yabin Li, Zhendong Peng, Haoxu Wang, Zhihao Du, Han Zhao, Zhifu Gao, Xiangang Li

TL;DR

MELA-TTS presents a unified transformer-diffusion framework for end-to-end TTS that omits speech tokenization and multi-stage pipelines by generating continuous mel-spectrogram frames. It introduces a representation alignment module that aligns Transformer decoder outputs with semantic ASR embeddings, improving content consistency and accelerating training. Through extensive experiments on LibriTTS and a large 170k-hour multilingual dataset, the approach achieves state-of-the-art or competitive WER/CER and solid zero-shot voice cloning, in both offline and streaming modes, and demonstrates strong scaling behavior. The work highlights the viability and advantages of continuous-feature generation in TTS, offering a compelling alternative to discrete-token-based methods and suggesting future applications in broader audio synthesis tasks.

Abstract

This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenization and multi-stage processing pipelines. To address the inherent difficulties of modeling continuous features, we propose a representation alignment module that aligns output representations of the transformer decoder with semantic embeddings from a pretrained ASR encoder during training. This mechanism not only speeds up training convergence, but also enhances cross-modal coherence between the textual and acoustic domains. Comprehensive experiments demonstrate that MELA-TTS achieves state-of-the-art performance across multiple evaluation metrics while maintaining robust zero-shot voice cloning capabilities, in both offline and streaming synthesis modes. Our results establish a new benchmark for continuous feature generation approaches in TTS, offering a compelling alternative to discrete-token-based paradigms.

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

TL;DR

Abstract

MELA-TTS: Joint transformer-diffusion model with representation alignment for speech synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)