Table of Contents
Fetching ...

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

Zexu Pan, Shengkui Zhao, Tingting Wang, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma

TL;DR

The paper addresses audio-visual speaker extraction in scenes with multiple co-occurring faces by introducing a plug-and-play Inter-Speaker Attention Module (ISAM) that processes a flexible number of on-screen faces. ISAM is integrated into two AVSE backbones, AV-DPRNN and AV-TFGridNet, and trained with dropout to handle missing faces; the model uses a self-attention mechanism along the speaker axis and a SI-SNR objective. Empirical results on VoxCeleb2, MISP, LRS2, and LRS3 show consistent improvements in SI-SNRi and related metrics across highly overlapped and sparsely overlapped mixtures, with larger gains when more co-occurring faces are observed, and robust cross-dataset generalization. The approach achieves these gains with only ~0.2M extra parameters, demonstrating practical plug-and-play applicability for real-world multi-person AVSE tasks.

Abstract

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

Plug-and-Play Co-Occurring Face Attention for Robust Audio-Visual Speaker Extraction

TL;DR

The paper addresses audio-visual speaker extraction in scenes with multiple co-occurring faces by introducing a plug-and-play Inter-Speaker Attention Module (ISAM) that processes a flexible number of on-screen faces. ISAM is integrated into two AVSE backbones, AV-DPRNN and AV-TFGridNet, and trained with dropout to handle missing faces; the model uses a self-attention mechanism along the speaker axis and a SI-SNR objective. Empirical results on VoxCeleb2, MISP, LRS2, and LRS3 show consistent improvements in SI-SNRi and related metrics across highly overlapped and sparsely overlapped mixtures, with larger gains when more co-occurring faces are observed, and robust cross-dataset generalization. The approach achieves these gains with only ~0.2M extra parameters, demonstrating practical plug-and-play applicability for real-world multi-person AVSE tasks.

Abstract

Audio-visual speaker extraction isolates a target speaker's speech from a mixture speech signal conditioned on a visual cue, typically using the target speaker's face recording. However, in real-world scenarios, other co-occurring faces are often present on-screen, providing valuable speaker activity cues in the scene. In this work, we introduce a plug-and-play inter-speaker attention module to process these flexible numbers of co-occurring faces, allowing for more accurate speaker extraction in complex multi-person environments. We integrate our module into two prominent models: the AV-DPRNN and the state-of-the-art AV-TFGridNet. Extensive experiments on diverse datasets, including the highly overlapped VoxCeleb2 and sparsely overlapped MISP, demonstrate that our approach consistently outperforms baselines. Furthermore, cross-dataset evaluations on LRS2 and LRS3 confirm the robustness and generalizability of our method.

Paper Structure

This paper contains 14 sections, 3 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: We explore the complementary speech activity cue offered by co-occurring on-screen face (red-dotted face box) when extracting the target (green non-dotted face box) speech in an audio-visual speaker extraction (AVSE) network.
  • Figure 2: We introduce an optional (dotted box) inter-speaker attention module (ISAM) to compute attention to the co-occurring (dotted line) face activities. The plug-and-play ISAM is easily adaptable to the majority of AVSE networks.