FlowAVSE: Efficient Audio-Visual Speech Enhancement with Conditional Flow Matching
Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung
TL;DR
FlowAVSE tackles audio-visual speech enhancement by overcoming slow diffusion-based inference through a conditional flow matching framework. It uses a two-stage architecture where a visual-informed predictor and a CNF-based refiner refine the speech with a lightweight U-net and cross-attention to align modalities, enabling one-step sampling. Empirical results on VoxCeleb2 and LRS3 show substantial speedups (≈22x) and reduced parameter count (≈50%) with competitive or superior SI-SDR, PESQ, and ESTOI. This approach enables real-time AV denoising with strong cross-modal denoising capabilities and robustness to in-the-wild noise.
Abstract
This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality. The demo page is available at: https://cyongong.github.io/FlowAVSE.github.io/
