Table of Contents
Fetching ...

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

Ruijie Tao, Xinyuan Qian, Yidi Jiang, Junjie Li, Jiadong Wang, Haizhou Li

TL;DR

The paper tackles audio-visual target speaker extraction (AV-TSE) by introducing a subtraction-based selective auditory attention framework, SEANet, which explicitly models and suppresses noisy interference. SEANet combines a parallel speech and noise learning (PSNL) block with a reverse attention module to enforce mutual exclusivity between target speech and interference, and it extends to multi-modal variants (F-SEANet, P-SEANet, A-SEANet) to leverage lip-visual cues more effectively. Through extensive experiments on five datasets (LRS2, VoxCeleb2, LRS3, Grid, TCD-TIMIT) and nine metrics, SEANet achieves state-of-the-art performance with a lightweight model (~8.7M parameters) and demonstrates strong cross-domain robustness. The results highlight the value of treating noise as a learnable auxiliary target within AV-TSE and suggest further exploration of auxiliary references and early fusion strategies for improved real-world performance.

Abstract

Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics, i.e., interference speaker and the background noise. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. The code can be found in: https://github.com/TaoRuijie/SEANet.git

Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention

TL;DR

The paper tackles audio-visual target speaker extraction (AV-TSE) by introducing a subtraction-based selective auditory attention framework, SEANet, which explicitly models and suppresses noisy interference. SEANet combines a parallel speech and noise learning (PSNL) block with a reverse attention module to enforce mutual exclusivity between target speech and interference, and it extends to multi-modal variants (F-SEANet, P-SEANet, A-SEANet) to leverage lip-visual cues more effectively. Through extensive experiments on five datasets (LRS2, VoxCeleb2, LRS3, Grid, TCD-TIMIT) and nine metrics, SEANet achieves state-of-the-art performance with a lightweight model (~8.7M parameters) and demonstrates strong cross-domain robustness. The results highlight the value of treating noise as a learnable auxiliary target within AV-TSE and suggest further exploration of auxiliary references and early fusion strategies for improved real-world performance.

Abstract

Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics, i.e., interference speaker and the background noise. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. The code can be found in: https://github.com/TaoRuijie/SEANet.git
Paper Structure (57 sections, 8 equations, 11 figures, 11 tables)

This paper contains 57 sections, 8 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Typical approaches in AV-TSE focus on 'Extraction', which searches for the target speaker's voice from the audio mixture to match the corresponding lip movements. However, the extraction results may involve noisy signals from incorrect sound sources. To alleviate these problems, we introduce the complementary 'subtraction' strategy. By analyzing this selective auditory attention, our proposed method utilizes the estimated noisy signal and excludes them during extraction.
  • Figure 2: Available auxiliary references in AV-TSE. Green line: speech-lip synchronization (positive correlated); Blue line: voice consistency (positive correlated); Orange line: speech-noise exclusivity (negative correlated).
  • Figure 3: The bottom panel is the data generation process for AV-TSE, the upper panel is our proposed SEANet. It extracts the clean speech of the target speaker from the mixture of audio. Specifically, SEANet contains $R$ repeated PSNL blocks to learn the selective auditory attention between the estimated clean speech and noisy signal, $M_{si}$ and $M_{ni}$ are the output speech and noise embeddings from the $i^{th}$ block, respectively. $\oplus$ represents entrywise sum (to mix up audios) and $\otimes$ represents matmul product. Note that Pre-extractor and Pre-suppressor share the same model architecture with Extractor and Suppressor, respectively.
  • Figure 4: The $i^{th}$ parallel speech and noise learning (PSNL) block. Intra-att and Inter-att blocks are used to learn the interaction between the target speaker's speech and the noisy signal. $M_{si}$ and $M_{ni}$ are the output embeddings of the speech and the noise, respectively. Dash lines denote residual connections. Pre-extractor/suppressor in Fig \ref{['SEANet']} has the same structure as the extractor/suppressor, while the input is $Y$, and the output is $M_{s0}$ or $M_{n0}$.
  • Figure 5: Illustration of the intra-chunk attention module (intra-att) or the inter-chunk attention module (inter-att). $F_s$ and $F_n$ represent the input speech and noisy embedding, respectively. $F^{\prime}_{s}$ and $F^{\prime}_{n}$ denote their respective processing outputs. This module contains the self-attention score $A_{s+}$ to learn the speech-lip relationship and the reverse attention score $A_{s-}$ to comprehend the speech-noise mutual exclusivity.
  • ...and 6 more figures