Table of Contents
Fetching ...

Pre-training Feature Guided Diffusion Model for Speech Enhancement

Yiyuan Yang, Niki Trigoni, Andrew Markham

TL;DR

This work tackles robust and efficient speech enhancement by introducing FUSE, a diffusion-based framework guided by pretraining features. It combines a VAE-based latent feature extractor with frozen BEATs-derived guidance during the diffusion reverse process and employs DDIM to dramatically reduce sampling steps. The method achieves state-of-the-art results on two public datasets and demonstrates strong cross-SNR robustness and deployment practicality without added parameters. Overall, FUSE offers a practical, high-quality diffusion-based solution for real-world speech enhancement tasks.

Abstract

Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.

Pre-training Feature Guided Diffusion Model for Speech Enhancement

TL;DR

This work tackles robust and efficient speech enhancement by introducing FUSE, a diffusion-based framework guided by pretraining features. It combines a VAE-based latent feature extractor with frozen BEATs-derived guidance during the diffusion reverse process and employs DDIM to dramatically reduce sampling steps. The method achieves state-of-the-art results on two public datasets and demonstrates strong cross-SNR robustness and deployment practicality without added parameters. Overall, FUSE offers a practical, high-quality diffusion-based solution for real-world speech enhancement tasks.

Abstract

Speech enhancement significantly improves the clarity and intelligibility of speech in noisy environments, improving communication and listening experiences. In this paper, we introduce a novel pretraining feature-guided diffusion model tailored for efficient speech enhancement, addressing the limitations of existing discriminative and generative models. By integrating spectral features into a variational autoencoder (VAE) and leveraging pre-trained features for guidance during the reverse process, coupled with the utilization of the deterministic discrete integration method (DDIM) to streamline sampling steps, our model improves efficiency and speech enhancement quality. Demonstrating state-of-the-art results on two public datasets with different SNRs, our model outshines other baselines in efficiency and robustness. The proposed method not only optimizes performance but also enhances practical deployment capabilities, without increasing computational demands.
Paper Structure (17 sections, 7 equations, 2 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 7 equations, 2 figures, 1 table, 2 algorithms.

Figures (2)

  • Figure 1: The workflow of the proposed FUSE method. It consists of three processes: latent feature extraction based on VAE (Sec. \ref{['sec3.1']}), pre-training feature guided diffusion model (Sec. \ref{['sec3.2']}), and restore to clean speech (Sec. \ref{['sec3.3']}). Within these, we specifically highlight three components that can enhance the efficiency and the parts that need to be trained or frozen during training process.
  • Figure 2: The results of different pretrained features and unconditional features training on WSJ0-CHiME3 dataset as a function of the number of reverse diffusion steps.