Table of Contents
Fetching ...

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann

TL;DR

This paper advances diffusion-based speech enhancement by introducing a drift-adapted forward SDE operating in the complex STFT domain and training a score-based model (SGMSE) with denoising score matching. It deploys a redesigned NCSN++-based architecture to estimate complex spectrogram scores, enabling efficient 30-step sampling and applicability to both speech enhancement and single-channel dereverberation. Across matched and mismatched datasets, cross-dataset generalization, and real-world DNS data, the proposed SGMSE+ consistently matches or exceeds discriminative baselines while delivering higher perceptual quality as evidenced by listening tests and non-intrusive metrics. The work also explores sampler configurations and confirms the method’s versatility, suggesting a practical generative framework for robust speech restoration in challenging conditions, with future directions toward further conditioning and faster inference.

Abstract

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

TL;DR

This paper advances diffusion-based speech enhancement by introducing a drift-adapted forward SDE operating in the complex STFT domain and training a score-based model (SGMSE) with denoising score matching. It deploys a redesigned NCSN++-based architecture to estimate complex spectrogram scores, enabling efficient 30-step sampling and applicability to both speech enhancement and single-channel dereverberation. Across matched and mismatched datasets, cross-dataset generalization, and real-world DNS data, the proposed SGMSE+ consistently matches or exceeds discriminative baselines while delivering higher perceptual quality as evidenced by listening tests and non-intrusive metrics. The work also explores sampler configurations and confirms the method’s versatility, suggesting a practical generative framework for robust speech restoration in challenging conditions, with future directions toward further conditioning and faster inference.

Abstract

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse.
Paper Structure (46 sections, 14 equations, 6 figures, 5 tables)

This paper contains 46 sections, 14 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Diffusion process on a spectrogram: In the forward process noise is gradually added to the clean speech spectrogram $\mathbf x_0$, while the reverse process learns to generate clean speech in an iterative fashion starting from the corrupted signal $\mathbf x_T$.
  • Figure 2: (Left) The forward and reverse process illustrated with a single scalar variable. The mean $\boldsymbol \mu$ (thick black line) of the forward process exponentially decays from clean speech $\mathbf x_0$ (blue) towards noisy speech $\mathbf y$ (green), and the standard deviation (shaded gray region) increases exponentially. The reverse process moves back to $\mathbf x_0$, starting from a slightly mismatched distribution $\tilde{p}_T$ which is centered around $\mathbf y$ rather than $\mathbf x_T$. Sample paths from both processes are shown as thin black lines. (Right) Time evolution of the snr of the mean $\boldsymbol \mu$ (black) with respect to the snr of $\mathbf y$ (green) for three different values of $\gamma$.
  • Figure 3: NCSN++ network architecture used as a score model $\mathbf s_\theta$: The architecture is based on a multi-resolution U-Net structure containing skip connections and an additional progressive growing path as shown in (a). Each up- and downsampling layer and the bottleneck layer consist of multiple residual blocks in series which are illustrated in (b).
  • Figure 4: Model performance in PESQ and SI-SDR as a function of (a) the number of reverse steps $N$ and (b) the step size parameter $r$ in the annealed Langevin corrector.
  • Figure 5: Violin plots showing POLQA results for the matched and the mismatched condition with dashed and dotted lines representing median and quartiles, respectively.
  • ...and 1 more figures