Table of Contents
Fetching ...

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Chenshuang Zhang, Chaoning Zhang, Sheng Zheng, Mengchun Zhang, Maryam Qamar, Sung-Ho Bae, In So Kweon

TL;DR

Diffusion models offer a powerful framework for audio, particularly in text-to-speech synthesis and speech enhancement. The paper reviews a broad taxonomy: diffusion-based acoustic models and vocoders for TTS, end-to-end diffusion approaches, and diffusion-driven audio restoration and super-resolution. It highlights pioneering works (e.g., Diff-TTS, WaveGrad, DiffWave) and efficiency-driven advances (ProDiff, DiffGAN-TTS, ILVR, BDDM, InferGrad, NU-Wave variants), and also covers discriminative vs generative enhancement, unsupervised restoration, and universal task frameworks. By organizing methods across representation, architecture, and objective (removal vs addition, single- vs multi-task), the survey provides a comprehensive, practical roadmap for researchers and practitioners seeking to apply diffusion models to speech tasks and to extend their robustness and efficiency. For practitioners, the synthesis of end-to-end versus two-stage pipelines, multi-speaker and style control, and fast inference techniques offers actionable guidance for deploying diffusion-based speech systems, while identifying open challenges in generalization, real-time performance, and cross-domain robustness.

Abstract

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

TL;DR

Diffusion models offer a powerful framework for audio, particularly in text-to-speech synthesis and speech enhancement. The paper reviews a broad taxonomy: diffusion-based acoustic models and vocoders for TTS, end-to-end diffusion approaches, and diffusion-driven audio restoration and super-resolution. It highlights pioneering works (e.g., Diff-TTS, WaveGrad, DiffWave) and efficiency-driven advances (ProDiff, DiffGAN-TTS, ILVR, BDDM, InferGrad, NU-Wave variants), and also covers discriminative vs generative enhancement, unsupervised restoration, and universal task frameworks. By organizing methods across representation, architecture, and objective (removal vs addition, single- vs multi-task), the survey provides a comprehensive, practical roadmap for researchers and practitioners seeking to apply diffusion models to speech tasks and to extend their robustness and efficiency. For practitioners, the synthesis of end-to-end versus two-stage pipelines, multi-speaker and style control, and fast inference techniques offers actionable guidance for deploying diffusion-based speech systems, while identifying open challenges in generalization, real-time performance, and cross-domain robustness.

Abstract

Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.
Paper Structure (26 sections, 2 equations, 1 figure, 5 tables)

This paper contains 26 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Development of text-to-speech frameworks. (a) Three-stage framework (b) One branch of two-stage framework that generates waveform from linguistic features directly (c) Another branch of two-stage framework that generates acoustic features from text directly.