Table of Contents
Fetching ...

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Jiabao Ai, Minghui Zhao, Anton Ragni

Abstract

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Abstract

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
Paper Structure (14 sections, 9 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 9 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: jump diffusion process for Mel-spectrograms.
  • Figure 2: Inference pipeline. Starting from a noisy and incomplete phone-level state, the model iteratively performs jumps and diffusion.
  • Figure 3: Upsample--Diffuse--Downsample (UDD).
  • Figure 4: DTW alignment paths at $0.75\times$ speed.
  • Figure 5: Example Mel-spectrograms at $0.75\times$ speed.