Table of Contents
Fetching ...

GDiffuSE: Diffusion-based speech enhancement with noise model guidance

Efrayim Yanir, David Burshtein, Sharon Gannot

TL;DR

GDiffuSE tackles robust speech enhancement under unseen noise by guiding a foundation diffusion model trained on clean speech with a lightweight noise distribution model. The method trains a noise model to capture the conditional noise distribution and integrates its gradient into the DDPM reverse process via a time-varying guidance factor, enabling sampling from $p(\mathbf{x}_0|\mathbf{y})$ without retraining the large backbone. Empirical results on BBC noise demonstrate improved PESQ and SI-SDR compared to a strong generative baseline, with notable gains for high-frequency unseen noise, while maintaining perceptual quality. The approach reduces the data and retraining burden for robust SE, making diffusion-based speech generation techniques more practical for real-world noisy environments.

Abstract

This paper introduces a novel speech enhancement (SE) approach based on a denoising diffusion probabilistic model (DDPM), termed Guided diffusion for speech enhancement (GDiffuSE). In contrast to conventional methods that directly map noisy speech to clean speech, our method employs a lightweight helper model to estimate the noise distribution, which is then incorporated into the diffusion denoising process via a guidance mechanism. This design improves robustness by enabling seamless adaptation to unseen noise types and by leveraging large-scale DDPMs originally trained for speech generation in the context of SE. We evaluate our approach on noisy signals obtained by adding noise samples from the BBC sound effects database to LibriSpeech utterances, showing consistent improvements over state-of-the-art baselines under mismatched noise conditions. Examples are available at our project webpage.

GDiffuSE: Diffusion-based speech enhancement with noise model guidance

TL;DR

GDiffuSE tackles robust speech enhancement under unseen noise by guiding a foundation diffusion model trained on clean speech with a lightweight noise distribution model. The method trains a noise model to capture the conditional noise distribution and integrates its gradient into the DDPM reverse process via a time-varying guidance factor, enabling sampling from without retraining the large backbone. Empirical results on BBC noise demonstrate improved PESQ and SI-SDR compared to a strong generative baseline, with notable gains for high-frequency unseen noise, while maintaining perceptual quality. The approach reduces the data and retraining burden for robust SE, making diffusion-based speech generation techniques more practical for real-world noisy environments.

Abstract

This paper introduces a novel speech enhancement (SE) approach based on a denoising diffusion probabilistic model (DDPM), termed Guided diffusion for speech enhancement (GDiffuSE). In contrast to conventional methods that directly map noisy speech to clean speech, our method employs a lightweight helper model to estimate the noise distribution, which is then incorporated into the diffusion denoising process via a guidance mechanism. This design improves robustness by enabling seamless adaptation to unseen noise types and by leveraging large-scale DDPMs originally trained for speech generation in the context of SE. We evaluate our approach on noisy signals obtained by adding noise samples from the BBC sound effects database to LibriSpeech utterances, showing consistent improvements over state-of-the-art baselines under mismatched noise conditions. Examples are available at our project webpage.

Paper Structure

This paper contains 12 sections, 19 equations, 2 figures, 2 tables, 2 algorithms.

Figures (2)

  • Figure 1: GDiffuSE: The trained noise model guides the diffusion model for SE. Training stage: Noise sample $\bar{{\bf w}} \in \mathbb{R}^N$ trains the noise models $\boldsymbol{\phi}_t$ for each $t$. Inference stage: Starting from ${\bf x}_t$ (white noise for $t=T$), the diffusion process, guided by the loss from $\phi_t$\ref{['eq:lossgdiff']}, generates ${x}_{t-1}$; the clean estimate is ${x}_0$. The input to $\phi_t$ is the noise estimate (which uses ${\bf y}$). This is repeated $T$ times (See Algorithms \ref{['alg:Training']}, \ref{['alg:Inference']}).
  • Figure 2: Spectograms assessment for sample NHU05093027 (monsoon forest) drawn from the BBC sound effect dataset.