Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Jiabao Ai; Minghui Zhao; Anton Ragni

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Jiabao Ai, Minghui Zhao, Anton Ragni

Abstract

Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Abstract

Paper Structure (14 sections, 9 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 14 sections, 9 equations, 5 figures, 2 tables, 2 algorithms.

Introduction
Background
Score-based Diffusion Probabilistic Modelling
Jump Processes in Generative Modeling
Method
Forward Jump Diffusion
Reverse Process (Inference)
Implementation Details
UDD and One-shot Variant
Experiments
Experimental Setup
Main Results and Comparisons
Adaptive Pauses in Slow Speech
Conclusion and Future Work

Figures (5)

Figure 1: jump diffusion process for Mel-spectrograms.
Figure 2: Inference pipeline. Starting from a noisy and incomplete phone-level state, the model iteratively performs jumps and diffusion.
Figure 3: Upsample--Diffuse--Downsample (UDD).
Figure 4: DTW alignment paths at $0.75\times$ speed.
Figure 5: Example Mel-spectrograms at $0.75\times$ speed.

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Abstract

Beyond Two-stage Diffusion TTS: Joint Structure and Content Refinement via Jump Diffusion

Authors

Abstract

Table of Contents

Figures (5)