Table of Contents
Fetching ...

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

Rong Chao, Wenze Ren, You-Jin Li, Kuo-Hsuan Hung, Sung-Feng Huang, Szu-Wei Fu, Wen-Huang Cheng, Yu Tsao

TL;DR

AVSEMamba addresses the cocktail party problem by integrating full-face visual cues with a Mamba state-space temporal-frequency backbone for audio-visual speech enhancement in multi-speaker settings. The architecture fuses audio features from STFT with spatiotemporal embeddings from a pretrained, frozen 3D ResNet-18 on full-face video, then processes the combined representation through Mamba blocks to jointly model time and frequency dependencies. Evaluated on AVSEC-4, AVSEMamba achieves first place on the monaural leaderboard, with substantial gains in MBSTOI, PESQ, and UTMOS compared to baselines, demonstrating the effectiveness and efficiency of state-space modeling for AVSE. The results suggest strong practical potential for robust, visually guided speech extraction in challenging acoustic environments and occlusions.

Abstract

Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.

Leveraging Mamba with Full-Face Vision for Audio-Visual Speech Enhancement

TL;DR

AVSEMamba addresses the cocktail party problem by integrating full-face visual cues with a Mamba state-space temporal-frequency backbone for audio-visual speech enhancement in multi-speaker settings. The architecture fuses audio features from STFT with spatiotemporal embeddings from a pretrained, frozen 3D ResNet-18 on full-face video, then processes the combined representation through Mamba blocks to jointly model time and frequency dependencies. Evaluated on AVSEC-4, AVSEMamba achieves first place on the monaural leaderboard, with substantial gains in MBSTOI, PESQ, and UTMOS compared to baselines, demonstrating the effectiveness and efficiency of state-space modeling for AVSE. The results suggest strong practical potential for robust, visually guided speech extraction in challenging acoustic environments and occlusions.

Abstract

Recent Mamba-based models have shown promise in speech enhancement by efficiently modeling long-range temporal dependencies. However, models like Speech Enhancement Mamba (SEMamba) remain limited to single-speaker scenarios and struggle in complex multi-speaker environments such as the cocktail party problem. To overcome this, we introduce AVSEMamba, an audio-visual speech enhancement model that integrates full-face visual cues with a Mamba-based temporal backbone. By leveraging spatiotemporal visual information, AVSEMamba enables more accurate extraction of target speech in challenging conditions. Evaluated on the AVSEC-4 Challenge development and blind test sets, AVSEMamba outperforms other monaural baselines in speech intelligibility (STOI), perceptual quality (PESQ), and non-intrusive quality (UTMOS), and achieves \textbf{1st place} on the monaural leaderboard.

Paper Structure

This paper contains 10 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: System architecture of the proposed AVSEMamba model.