Table of Contents
Fetching ...

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung

TL;DR

FlowAVSE tackles audio-visual speech enhancement by overcoming slow diffusion-based inference through a conditional flow matching framework. It uses a two-stage architecture where a visual-informed predictor and a CNF-based refiner refine the speech with a lightweight U-net and cross-attention to align modalities, enabling one-step sampling. Empirical results on VoxCeleb2 and LRS3 show substantial speedups (≈22x) and reduced parameter count (≈50%) with competitive or superior SI-SDR, PESQ, and ESTOI. This approach enables real-time AV denoising with strong cross-modal denoising capabilities and robustness to in-the-wild noise.

Abstract

This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available at: https://cyongong.github.io/FlowAVSE.github.io/

FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching

TL;DR

FlowAVSE tackles audio-visual speech enhancement by overcoming slow diffusion-based inference through a conditional flow matching framework. It uses a two-stage architecture where a visual-informed predictor and a CNF-based refiner refine the speech with a lightweight U-net and cross-attention to align modalities, enabling one-step sampling. Empirical results on VoxCeleb2 and LRS3 show substantial speedups (≈22x) and reduced parameter count (≈50%) with competitive or superior SI-SDR, PESQ, and ESTOI. This approach enables real-time AV denoising with strong cross-modal denoising capabilities and robustness to in-the-wild noise.

Abstract

This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available at: https://cyongong.github.io/FlowAVSE.github.io/
Paper Structure (11 sections, 10 equations, 5 figures, 3 tables)

This paper contains 11 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Analysis of inference speed, parameter size, and SI-SDR scores on VoxCeleb2 test set. The real-time factor measures the time needed for 1 second of audio generation. Our small size model showcases an inference speed approximately 22 times faster and 2 times lighter than the previous model while achieving superior performance.
  • Figure 2: Model architecture of FlowAVSE. Face-cropped images of the speaker are fed to the visual encoder to acquire visual embedding $\mathbf{f_v}$. Through the $P_\theta$ and $G_\phi$, visual embedding $\mathbf{f_v}$ is fused with auditory information from the noisy speech $\mathbf{y}$ to obtain an enhanced speech $\hat{\mathbf{x}}$. Both stages consist of U-net architecture and are trained simultaneously by $\mathcal{L}_{total}$.
  • Figure 3: A simplified illustration of the U-net architecture in our model. We remove duplicate convolution modules for enhanced efficiency in our small and medium size models.
  • Figure 4: Comparison of AVDiffuSS and our model on the LRS3 test set across various sampling steps. Our model attains robust performances even with a single-step inference.
  • Figure 5: Comparison between the audio spectrograms of the diffusion-based model and ours. It shows that our model excels in removing not only the synthesized noise but also the background noise recorded along with the target speech.