Table of Contents
Fetching ...

Discrete-Time Diffusion-Like Models for Speech Synthesis

Xiaozhou Tan, Minghui Zhao, Anton Ragni

TL;DR

The paper addresses training/inference mismatch in diffusion-based speech synthesis by proposing fully discrete-time diffusion-like models with four noise types: additive Gaussian, multiplicative Gaussian, blurring, and a Gaussian-blurring mixture. It trains to predict clean data using a consistent discrete framework and introduces two inference schemes for iterative refinement. Experimental results on LJ Speech show discrete-time variants achieve comparable objective metrics to a continuous baseline, with some noise types offering improvements in log f0 and perceptual quality, while others trade off randomness for stability. The work demonstrates that discrete-time diffusion-like models can deliver efficient, consistent training and competitive speech quality, broadening the noise-design space for TTS.

Abstract

Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.

Discrete-Time Diffusion-Like Models for Speech Synthesis

TL;DR

The paper addresses training/inference mismatch in diffusion-based speech synthesis by proposing fully discrete-time diffusion-like models with four noise types: additive Gaussian, multiplicative Gaussian, blurring, and a Gaussian-blurring mixture. It trains to predict clean data using a consistent discrete framework and introduces two inference schemes for iterative refinement. Experimental results on LJ Speech show discrete-time variants achieve comparable objective metrics to a continuous baseline, with some noise types offering improvements in log f0 and perceptual quality, while others trade off randomness for stability. The work demonstrates that discrete-time diffusion-like models can deliver efficient, consistent training and competitive speech quality, broadening the noise-design space for TTS.

Abstract

Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.

Paper Structure

This paper contains 16 sections, 10 equations, 1 figure, 3 tables, 2 algorithms.

Figures (1)

  • Figure 1: Detailed breakdown of MOS score counts