Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang, Jinchao Zhu, Feng Dong, Shuyue Zhu
TL;DR
PCMANet tackles Audio-Visual Segmentation by integrating audio–visual cues through Audio-Visual Group Attention (AVGA) and enhancing cross-modal fusion with Query-Selected Cross-Attention (QSCA). It introduces Confidence-Induced Masking (CIM) to progressively mask low-confidence tokens and Guided Fusion (GF) to refine multi-stage predictions, achieving substantial computational savings while maintaining or improving segmentation quality. The paper provides both empirical and theoretical justifications for efficiency, including the complexity reduction from full MSA to masked QSCA using $N'=rN$ tokens, with $\Omega(\text{QSCA})=2NC^2+2N'C^2+2NN'C$ and $N'=rN$, $r\rightarrow<0.1$. Evaluations on AVSBench datasets (S4, MS3, AVSS) demonstrate state-of-the-art performance with lower FLOPs and faster inference, highlighting PCMANet’s practicality for real-world, edge-enabled AVS tasks.
Abstract
Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network's multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources. The code is available at: https://github.com/PrettyPlate/PCMANet.
