Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction
Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma
TL;DR
The paper addresses audio-visual speaker extraction in scenes with multiple co-occurring faces by introducing a plug-and-play Inter-Speaker Attention Module (ISAM) that processes a flexible number of on-screen faces. ISAM is integrated into two AVSE backbones, AV-DPRNN and AV-TFGridNet, and trained with dropout to handle missing faces; the model uses a self-attention mechanism along the speaker axis and a SI-SNR objective. Empirical results on VoxCeleb2, MISP, LRS2, and LRS3 show consistent improvements in SI-SNRi and related metrics across highly overlapped and sparsely overlapped mixtures, with larger gains when more co-occurring faces are observed, and robust cross-dataset generalization. The approach achieves these gains with only ~0.2M extra parameters, demonstrating practical plug-and-play applicability for real-world multi-person AVSE tasks.
Abstract
Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.
