Audio-Visual Target Speaker Extraction with Reverse Selective Auditory Attention
Ruijie Tao, Xinyuan Qian, Yidi Jiang, Junjie Li, Jiadong Wang, Haizhou Li
TL;DR
The paper tackles audio-visual target speaker extraction (AV-TSE) by introducing a subtraction-based selective auditory attention framework, SEANet, which explicitly models and suppresses noisy interference. SEANet combines a parallel speech and noise learning (PSNL) block with a reverse attention module to enforce mutual exclusivity between target speech and interference, and it extends to multi-modal variants (F-SEANet, P-SEANet, A-SEANet) to leverage lip-visual cues more effectively. Through extensive experiments on five datasets (LRS2, VoxCeleb2, LRS3, Grid, TCD-TIMIT) and nine metrics, SEANet achieves state-of-the-art performance with a lightweight model (~8.7M parameters) and demonstrates strong cross-domain robustness. The results highlight the value of treating noise as a learnable auxiliary target within AV-TSE and suggest further exploration of auxiliary references and early fusion strategies for improved real-world performance.
Abstract
Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics, i.e., interference speaker and the background noise. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. The code can be found in: https://github.com/TaoRuijie/SEANet.git
