GDiffuSE: Diffusion-based speech enhancement with noise model guidance
Efrayim Yanir, David Burshtein, Sharon Gannot
TL;DR
GDiffuSE tackles robust speech enhancement under unseen noise by guiding a foundation diffusion model trained on clean speech with a lightweight noise distribution model. The method trains a noise model to capture the conditional noise distribution and integrates its gradient into the DDPM reverse process via a time-varying guidance factor, enabling sampling from $p(\mathbf{x}_0|\mathbf{y})$ without retraining the large backbone. Empirical results on BBC noise demonstrate improved PESQ and SI-SDR compared to a strong generative baseline, with notable gains for high-frequency unseen noise, while maintaining perceptual quality. The approach reduces the data and retraining burden for robust SE, making diffusion-based speech generation techniques more practical for real-world noisy environments.
Abstract
This paper introduces a novel speech enhancement (SE) approach based on a denoising diffusion probabilistic model (DDPM), termed Guided diffusion for speech enhancement (GDiffuSE). In contrast to conventional methods that directly map noisy speech to clean speech, our method employs a lightweight helper model to estimate the noise distribution, which is then incorporated into the diffusion denoising process via a guidance mechanism. This design improves robustness by enabling seamless adaptation to unseen noise types and by leveraging large-scale DDPMs originally trained for speech generation in the context of SE. We evaluate our approach on noisy signals obtained by adding noise samples from the BBC sound effects database to LibriSpeech utterances, showing consistent improvements over state-of-the-art baselines under mismatched noise conditions. Examples are available at our project webpage.
