Table of Contents
Fetching ...

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

Yuxuan Wang, Jinchao Zhu, Feng Dong, Shuyue Zhu

TL;DR

PCMANet tackles Audio-Visual Segmentation by integrating audio–visual cues through Audio-Visual Group Attention (AVGA) and enhancing cross-modal fusion with Query-Selected Cross-Attention (QSCA). It introduces Confidence-Induced Masking (CIM) to progressively mask low-confidence tokens and Guided Fusion (GF) to refine multi-stage predictions, achieving substantial computational savings while maintaining or improving segmentation quality. The paper provides both empirical and theoretical justifications for efficiency, including the complexity reduction from full MSA to masked QSCA using $N'=rN$ tokens, with $\Omega(\text{QSCA})=2NC^2+2N'C^2+2NN'C$ and $N'=rN$, $r\rightarrow<0.1$. Evaluations on AVSBench datasets (S4, MS3, AVSS) demonstrate state-of-the-art performance with lower FLOPs and faster inference, highlighting PCMANet’s practicality for real-world, edge-enabled AVS tasks.

Abstract

Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network's multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources. The code is available at: https://github.com/PrettyPlate/PCMANet.

Progressive Confident Masking Attention Network for Audio-Visual Segmentation

TL;DR

PCMANet tackles Audio-Visual Segmentation by integrating audio–visual cues through Audio-Visual Group Attention (AVGA) and enhancing cross-modal fusion with Query-Selected Cross-Attention (QSCA). It introduces Confidence-Induced Masking (CIM) to progressively mask low-confidence tokens and Guided Fusion (GF) to refine multi-stage predictions, achieving substantial computational savings while maintaining or improving segmentation quality. The paper provides both empirical and theoretical justifications for efficiency, including the complexity reduction from full MSA to masked QSCA using tokens, with and , . Evaluations on AVSBench datasets (S4, MS3, AVSS) demonstrate state-of-the-art performance with lower FLOPs and faster inference, highlighting PCMANet’s practicality for real-world, edge-enabled AVS tasks.

Abstract

Audio and visual signals typically occur simultaneously, and humans possess an innate ability to correlate and synchronize information from these two modalities. Recently, a challenging problem known as Audio-Visual Segmentation (AVS) has emerged, intending to produce segmentation maps for sounding objects within a scene. However, the methods proposed so far have not sufficiently integrated audio and visual information, and the computational costs have been extremely high. Additionally, the outputs of different stages have not been fully utilized. To facilitate this research, we introduce a novel Progressive Confident Masking Attention Network (PMCANet). It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames. Furthermore, we design an efficient and effective cross-attention module to enhance semantic perception by selecting query tokens. This selection is determined through confidence-driven units based on the network's multi-stage predictive outputs. Experiments demonstrate that our network outperforms other AVS methods while requiring less computational resources. The code is available at: https://github.com/PrettyPlate/PCMANet.
Paper Structure (23 sections, 12 equations, 11 figures, 4 tables)

This paper contains 23 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: An example of the audio-visual segmentation task (S4 dataset). The video depicts a man playing the glockenspiel. Within the clip, only the glockenspiel is sounding therefore it is the only object that is labeled.
  • Figure 2: Overview of the pipeline. The network takes the video clip as input and separates it into visual frames and audio spectrograms. The outputs of the audio and visual encoders are denoted as $F_i$ and $A$, respectively. The features are first integrated by AVGAs and are further processed by QSCAs to generate guide features $G_{i}$. Meanwhile, the multi-stage outputs, which are aggregated by GFs, will pass through CIMs to generate confidence masks and then be sent to QSCAs for further optimization. The bottom-right corner presents the internal processing logic of the GF module. The deepest GF, which does not receive the feature $H$ as input, is labeled as GF*.
  • Figure 3: Structure of Audio-Visual Group Attention (AVGA). It integrates audio and visual features using group attention operations. The visual features are divided into multiple groups, with each group fused with audio information through the "Attn" module. This module first normalizes the input features and then performs fusion using a dot-product operation.
  • Figure 4: Structure of Query-Selected Cross-Attention (QSCA). Right: QSCA takes audio features, visual features, and an external binary mask as inputs and outputs the integrated features. Modal information is exchanged through two multi-head cross-attention blocks, with one employing query-selected attention. Left: Visual queries are selected using gather and scatter operations based on the mask input.
  • Figure 5: An example of the QSCA and Confidence-Induced Masking (CIM). Dark and light pixel regions represent masked and valid queries, respectively. The mask is generated from the sigmoid output, followed by thresholding and binarization. Progressive masking is achieved through iterative mask multiplication.
  • ...and 6 more figures