Table of Contents
Fetching ...

LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling

Xiaoyu Fan, Huizhi Xie, Wei Zou, Yunzhang Chen

Abstract

Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero-shot speech editing--including word-level insertion, deletion, and substitution--without any additional training. Theoretically, we prove that AR-pretrained weights are near-optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM-based AR TTS system. Code and audio samples will be available at https://deft-piroshki-b652b5.netlify.app/.

LLaDA-TTS: Unifying Speech Synthesis and Zero-Shot Editing via Masked Diffusion Modeling

Abstract

Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero-shot speech editing--including word-level insertion, deletion, and substitution--without any additional training. Theoretically, we prove that AR-pretrained weights are near-optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM-based AR TTS system. Code and audio samples will be available at https://deft-piroshki-b652b5.netlify.app/.

Paper Structure

This paper contains 35 sections, 2 theorems, 3 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Under Assumption ass:eps_forward, let $\tilde{X}$ be a partial masking of $X$ where each token is independently revealed with probability $1{-}\tau$. For any masked position $i$, the expected KL divergence between the optimal bidirectional predictor and the left-context-only (AR) predictor satisfies

Figures (5)

  • Figure 1: LLaDA-TTS architecture overview. A bidirectional Transformer (Qwen2) iteratively unmasks speech tokens in $T$ steps. The text encoder, sequence format, and downstream flow matching vocoder remain identical to the AR baseline.
  • Figure 2: Speech editing pipeline: (1) align text$\to$speech via attention, (2) mask affected region with context margins, (3) regenerate via iterative unmasking. The surrounding tokens remain frozen, providing bidirectional conditioning throughout.
  • Figure 3: Speed--quality tradeoff. Left axis: test-zh CER (%) vs. denoising steps; right axis: LLM-stage RTF on A100. The dashed green line marks the CosyVoice 3 AR baseline CER (1.21%). LLaDA-TTS surpasses the AR baseline at 48 steps (${\sim}2.6{\times}$ speedup) and achieves 0.74% CER at 96 steps.
  • Figure 4: Unmasking process for a Chinese utterance (64 steps). Each column is a speech token position; color indicates the step at which the token was unmasked (blue=early, red=late, gray=still masked). The generation exhibits a predominantly left-to-right progression with confidence-based deviations.
  • Figure 5: Emergent text-to-speech alignment in LLaDA-TTS. Axes: text tokens ($\rightarrow$) vs. speech tokens ($\downarrow$).

Theorems & Definitions (2)

  • Theorem 1: Bounded Suboptimality of AR Initialization
  • Corollary 1: Emergence of Left-to-Right Unmasking