Table of Contents
Fetching ...

Investigating the Design Space of Diffusion Models for Speech Enhancement

Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May

TL;DR

The paper addresses diffusion-based speech enhancement and demonstrates that extending the EDM framework to accommodate a non-zero long-term mean toward a conditioner enables thorough design-space exploration. By systematically varying preconditioning, loss weighting, SDE, and reverse-process stochasticity, the authors show that drift toward the conditioner is not essential and that a Heun-based sampler can dramatically reduce sampling steps while preserving or improving performance. The approach yields a fourfold improvement in computational cost over a baseline diffusion method and outperforms several discriminative baselines on perceptual metrics in matched conditions. These findings support a fully generative view for speech enhancement diffusion models and suggest practical avenues for more efficient inference and broader application.

Abstract

Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.

Investigating the Design Space of Diffusion Models for Speech Enhancement

TL;DR

The paper addresses diffusion-based speech enhancement and demonstrates that extending the EDM framework to accommodate a non-zero long-term mean toward a conditioner enables thorough design-space exploration. By systematically varying preconditioning, loss weighting, SDE, and reverse-process stochasticity, the authors show that drift toward the conditioner is not essential and that a Heun-based sampler can dramatically reduce sampling steps while preserving or improving performance. The approach yields a fourfold improvement in computational cost over a baseline diffusion method and outperforms several discriminative baselines on perceptual metrics in matched conditions. These findings support a fully generative view for speech enhancement diffusion models and suggest practical avenues for more efficient inference and broader application.

Abstract

Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.
Paper Structure (42 sections, 42 equations, 6 figures, 1 table, 2 algorithms)

This paper contains 42 sections, 42 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: Example evolution of the diffusion process for each sde investigated. In the speech enhancement case (a), the initial state of the diffusion process is a complex stft representation of the clean speech ${\boldsymbol{x}_0}$. Each sde provokes a drift of the mean of the process towards the noisy speech ${\boldsymbol{y}}$ at a different rate. The magnitude of the complex-valued diffusion process is plotted. In the image generation case (b), the initial state is a clean image and no drift towards the conditioner occurs, which is equivalent to setting ${\boldsymbol{y} \!=\! \mathbf{0}}$ in our framework.
  • Figure 2: Illustration of the different sde investigated.
  • Figure 3: Speech enhancement performance as a function of the number of sampling steps ${n_{\mathrm{steps}}}$ when changing the preconditioning parameters incrementally from the values in Eq. \ref{['eq:sgmse_precond']} used in welker2022speechrichter2023speech to the values in Eq. \ref{['eq:preconditioning']} suggested in karras2022elucidating. $c_{\mathrm{shift}}$ is changed from $\boldsymbol{y}$ to $\mathbf{0}$.
  • Figure 4: Speech enhancement performance as a function of the number of sampling steps ${n_{\mathrm{steps}}}$ for different sde.
  • Figure 5: Speech enhancement performance as a function of the amount of stochasticity injected in the reverse process for different numbers of sampling steps ${n_{\mathrm{steps}}}$. For the pc sampler (a), the stochasticity is controlled by the step size of the annealed Langevin dynamics correction ${r}$. For the Heun-based sampler (b), this is controlled by the parameter ${S_{\mathrm{churn}}}$.
  • ...and 1 more figures