Pre-training Feature Guided Diffusion Model for Speech Enhancement

Yiyuan Yang; Niki Trigoni; Andrew Markham

Pre-training Feature Guided Diffusion Model for Speech Enhancement

Yiyuan Yang, Niki Trigoni, Andrew Markham

TL;DR

This work tackles robust and efficient speech enhancement by introducing FUSE, a diffusion-based framework guided by pretraining features. It combines a VAE-based latent feature extractor with frozen BEATs-derived guidance during the diffusion reverse process and employs DDIM to dramatically reduce sampling steps. The method achieves state-of-the-art results on two public datasets and demonstrates strong cross-SNR robustness and deployment practicality without added parameters. Overall, FUSE offers a practical, high-quality diffusion-based solution for real-world speech enhancement tasks.

Abstract

Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.

Pre-training Feature Guided Diffusion Model for Speech Enhancement

TL;DR

Abstract

Paper Structure (17 sections, 7 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 7 equations, 2 figures, 1 table, 2 algorithms.

Introduction
Related work
Speech enhancement
Conditional diffusion models
Denoising diffusion implicit model (DDIM)
Proposed method
Latent feature extraction based on VAE
Pre-training feature guided diffusion model
Restore to clean speech
Experiment
Experimental setup
Result
Conclusion
Appendix
Datasets
...and 2 more sections

Figures (2)

Figure 1: The workflow of the proposed FUSE method. It consists of three processes: latent feature extraction based on VAE (Sec. \ref{['sec3.1']}), pre-training feature guided diffusion model (Sec. \ref{['sec3.2']}), and restore to clean speech (Sec. \ref{['sec3.3']}). Within these, we specifically highlight three components that can enhance the efficiency and the parts that need to be trained or frozen during training process.
Figure 2: The results of different pretrained features and unconditional features training on WSJ0-CHiME3 dataset as a function of the number of reverse diffusion steps.

Pre-training Feature Guided Diffusion Model for Speech Enhancement

TL;DR

Abstract

Pre-training Feature Guided Diffusion Model for Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (2)