DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Hyung-Seok Oh; Sang-Hoon Lee; Seong-Whan Lee

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Hyung-Seok Oh, Sang-Hoon Lee, Seong-Whan Lee

TL;DR

DiffProsody addresses the challenge of expressive, natural-sounding TTS with efficient inference by introducing a diffusion-based latent prosody generator (DLPG) and prosody conditional adversarial training. The DLPG, built on a DDGAN framework, reduces diffusion timesteps while a prosody conditional discriminator guides the TTS module to reflect accurate prosody; a vector quantization layer aids prosody disentanglement. Empirical results on VCTK show DiffProsody surpasses FastSpeech 2 and ProsoSpeech in MOS and several objective metrics, while also achieving faster prosody generation. The work demonstrates a practical path toward high-quality, expressive TTS with reduced latency, and outlines future improvements in vector quantization and language-model pretraining for further gains.

Abstract

Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments.

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

TL;DR

Abstract

Paper Structure (41 sections, 32 equations, 7 figures, 5 tables)

This paper contains 41 sections, 32 equations, 7 figures, 5 tables.

Introduction
Related works
Non-autoregressive text-to-speech
Generative adversarial networks
Denoising diffusion models
Prosody modeling
DiffProsody
Text-to-speech module
Prosody module
Prosody conditional adversarial training
Diffusion-based latent prosody generator
Inference
Experimental result and discussion
Experimental setup
Implementation details
...and 26 more sections

Figures (7)

Figure 1: Framework of DiffProsody. (a) Overall architecture including TTS and prosody modeling with prosody conditional adversarial training; (b) Prosody modeling by vector quantization with prosody encoder and diffusion-based latent prosody generator; (c) Text encoder that models text at the phoneme-level and word-level; (d) Prosody encoder that models the word-level target prosody; (e) Prosody conditional discriminator for adversarial training. DP represents a duration predictor, and LR represents a length regulator. In the first stage, the TTS and prosody encoder are trained jointly, and in the second stage, a diffusion-based latent prosody generator (DLPG) is trained using the output of the pre-trained prosody encoder as a target. In inference, the TTS module synthesizes speech conditioned on the prosody vector generated by DLPG.
Figure 2: Training a diffusion-based latent prosody generator. We adopt the design of DDGANsxiao2022tackling to shorten the diffusion timestep. The generator $G_\theta$ takes speaker hidden representation $\mathbf{h}_{spk}$ and text hidden representation$\mathbf{h}_{txt}$, timestep $t$, and noisy data $\mathbf{x}_{t}$ as input to generate $\mathbf{x}_{0}'$, and the disriminator $D_\phi$ determines which of $\mathbf{x}_{t-1}'$ obtained by posterior sampling on $\mathbf{x}_{0}'$ and $\mathbf{x}_{t-1}$ obtained by forward process on $\mathbf{x}_{0}$ is compatible with $\mathbf{x}_{t}$ at $t$ timestep.
Figure 3: Comparison of the visualized spectrogram and pitch contour. The red box indicates that the proposed model is more similar to the GT.
Figure 4: Histogram visualization of log f0, where the blue bars represent the GT distribution and orange bars represent the generated distribution. The distribution of the proposed model overlaps to a greater extent with the GT distribution than the other comparison models.
Figure 6: Comparison of objective evaluation results based on diffusion timesteps when using the DDPM and DDGAN framework in DLPG. The blue line is the result for the DDGAN and the red line is the result for the DDPM.
...and 2 more figures

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

TL;DR

Abstract

DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Authors

TL;DR

Abstract

Table of Contents

Figures (7)