Table of Contents
Fetching ...

Single and Few-step Diffusion for Generative Speech Enhancement

Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann

TL;DR

This work tackles slow diffusion-based speech enhancement (SE) by introducing a two-stage training pipeline that first learns via denoising score matching and then fine-tunes with a predictive reverse-process loss. The second stage corrects discretization and prior-mismatch errors, enabling high-quality SE with as few as 5 function evaluations, closely matching or surpassing the 60-EFE baseline. The proposed approach demonstrates robustness to reduced NFEs and better generalization to unseen data compared to predictive baselines and StoRM variants. The method leverages Brownian Bridge with Exponential Diffusion Coefficient (BBED) and combines DSM and CRP to deliver efficient, high-quality SE in practical, fast-inference regimes.

Abstract

Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

Single and Few-step Diffusion for Generative Speech Enhancement

TL;DR

This work tackles slow diffusion-based speech enhancement (SE) by introducing a two-stage training pipeline that first learns via denoising score matching and then fine-tunes with a predictive reverse-process loss. The second stage corrects discretization and prior-mismatch errors, enabling high-quality SE with as few as 5 function evaluations, closely matching or surpassing the 60-EFE baseline. The proposed approach demonstrates robustness to reduced NFEs and better generalization to unseen data compared to predictive baselines and StoRM variants. The method leverages Brownian Bridge with Exponential Diffusion Coefficient (BBED) and combines DSM and CRP to deliver efficient, high-quality SE in practical, fast-inference regimes.

Abstract

Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.
Paper Structure (18 sections, 11 equations, 1 figure, 1 table)

This paper contains 18 sections, 11 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: Performance results when trained and tested on WSj0-C3 of the proposed CRP, StoRM-BBED, predictive baseline and the generative baseline as a function of NFEs. In this work, the NFEs is the number of NCSN++ evaluations. The number of discretization steps $n_{\text{steps}}$ is chosen as described in Section \ref{['sec:exp:crp']}. We have that $n_{\text{steps}} = \text{NFE}$ for CRP and the generative baseline. For StoRM-BBED, we have $n_{\text{steps}} + 1= \text{NFE}$ (see Section \ref{['sec:exp:baseline']}). All solid lines use the same discretization schedule as described in Section \ref{['sec:exp:crp']}. Dotted blue line uses discretization schedule as in Section \ref{['sec:bbed-para']}.