Table of Contents
Fetching ...

Rethinking Flow and Diffusion Bridge Models for Speech Enhancement

Dahan Wang, Jun Gao, Tong Lei, Yuxiang Hu, Changbao Zhu, Kai Chen, Jing Lu

TL;DR

This work unifies flow matching, diffusion bridge, and Schrödinger bridge approaches for speech enhancement by representing them as Gaussian probability paths between paired noisy and clean speech. It shows that sampling steps trained with a data-prediction objective effectively perform predictive SE, and it introduces an enhanced bridge model that combines a TF-GridNet backbone with time embeddings, a refined loss, and a predictive fine-tuning strategy to achieve state-of-the-art efficiency and accuracy. The results reveal that the predictive nature of these generative frameworks imposes an upper bound on achievable performance, while the proposed design mitigates complexity and improves outcomes on denoising and dereverberation benchmarks. The work provides a practical blueprint for leveraging predictive strategies within generative SE models to push performance with fewer resources, and it clarifies fundamental limits inherent to this paradigm.

Abstract

Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.

Rethinking Flow and Diffusion Bridge Models for Speech Enhancement

TL;DR

This work unifies flow matching, diffusion bridge, and Schrödinger bridge approaches for speech enhancement by representing them as Gaussian probability paths between paired noisy and clean speech. It shows that sampling steps trained with a data-prediction objective effectively perform predictive SE, and it introduces an enhanced bridge model that combines a TF-GridNet backbone with time embeddings, a refined loss, and a predictive fine-tuning strategy to achieve state-of-the-art efficiency and accuracy. The results reveal that the predictive nature of these generative frameworks imposes an upper bound on achievable performance, while the proposed design mitigates complexity and improves outcomes on denoising and dereverberation benchmarks. The work provides a practical blueprint for leveraging predictive strategies within generative SE models to push performance with fewer resources, and it clarifies fundamental limits inherent to this paradigm.

Abstract

Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.
Paper Structure (40 sections, 73 equations, 4 figures, 4 tables)

This paper contains 40 sections, 73 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Illustration of the backbone network's working mechanism during training and ODE-based sampling (as expressed in Eq. (\ref{['eq.20']})) under the data prediction strategy.
  • Figure 2: Weight distribution of network outputs at each step in ODE-based sampling result (SB-CFM parameterization, and $N = 10$). The arrows indicate that sampling proceeds in the reverse time direction.
  • Figure 3: Schematic illustration of the time-embedding-assisted TF-GridNet.
  • Figure 4: Average PESQ and UTMOS of network outputs at each step during sampling ($N=5$) for the proposed bridge model (without CRP). Dots represent the scores of intermediate network outputs; lines indicate the metrics of the predictive TF-GridNet output and the final sampling results of the proposed bridge model with and without CRP fine-tuning. The arrows indicate that sampling is performed in the reverse time direction.