Table of Contents
Fetching ...

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation

Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann

TL;DR

This work presents a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion, and shows that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude.

Abstract

Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online (https://uhh.de/inf-sp-storm).

StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation

TL;DR

This work presents a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion, and shows that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude.

Abstract

Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online (https://uhh.de/inf-sp-storm).
Paper Structure (31 sections, 16 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 31 sections, 16 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Visualization of the forward and backward processes in \ref{['eq:ouve-sde']}. Mean curve \ref{['eq:mean']} is in solid black and variance \ref{['eq:std']} is represented by the greyed area. Several realizations of the diffusion process are represented by thin black lines. The mismatch between $p_\tau$ centered on $\mathbf{x_\tau}$ and $\tilde{p}_\tau$ centered on $\mathbf{y}$ comes from the fact that the mean in \ref{['eq:mean']} can not reach $\mathbf{y}$ in finite time. This mismatch causes unavoidable bias in the reverse process, even were the score perfectly known.
  • Figure 2: Visualization of samples obtained with predictive approach (NCSN++M, see Section \ref{['sec:exp']}) and generative model (SGMSE+M, see Richter2022SGMSE++ and Section \ref{['sec:exp']}) for two ill-posed problems, namely speech dereverberation (top, from Lemercier2022icassp) and JPEG artifact removal (bottom, from Welker2023). Spectrograms horizontal and vertical axes represent time and frequency respectively.
  • Figure 3: Log-energy spectrograms of clean, noisy, processed and residual utterances for denoising (top) and dereverberation (bottom). The predictor used is NCSN++M .
  • Figure 4: Proposed stochastic regeneration inference process. The predictive network is first used to generate a denoised version $D_\theta(\mathbf{y})$. Diffusion-based generation $G_\phi$ is then performed by adding Gaussian noise $\sigma(T)\mathbf{z}$ to obtain the start sample $\mathbf{x}_T$ and solving the reverse diffusion sde \ref{['eq:plug-in-reverse-sde']}, yielding a sample from the estimated posterior $\mathbf{x}_0 \sim p(\mathbf{x}|D_\theta(\mathbf{y}))$.
  • Figure 5: Visualization of the inference process for the predictive, generative and proposed StoRM models for a complex posterior distribution (see also Figure 1 in delbracio2023inversion. With the proposed two-stage inference, StoRM uses the predictive mapping to the posterior mean $\mathbb{E}[\mathbf x | \mathbf y]$ as an intermediate step for easier generative inference of a posterior sample $\mathbf x$ which is more likely to lie in high-density regions of the posterior $p(\mathbf x | \mathbf y)$
  • ...and 5 more figures