Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement
Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, Yu Tsao
TL;DR
AVSEMamba addresses the cocktail party problem by integrating full-face visual cues with a Mamba state-space temporal-frequency backbone for audio-visual speech enhancement in multi-speaker settings. The architecture fuses audio features from STFT with spatiotemporal embeddings from a pretrained, frozen 3D ResNet-18 on full-face video, then processes the combined representation through Mamba blocks to jointly model time and frequency dependencies. Evaluated on AVSEC-4, AVSEMamba achieves first place on the monaural leaderboard, with substantial gains in MBSTOI, PESQ, and UTMOS compared to baselines, demonstrating the effectiveness and efficiency of state-space modeling for AVSE. The results suggest strong practical potential for robust, visually guided speech extraction in challenging acoustic environments and occlusions.
Abstract
Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.
