Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment
Chen Liu, Peike Li, Liying Yang, Dadong Wang, Lincheng Li, Xin Yu
TL;DR
Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment addresses ambiguous audio-visual correspondences by introducing an Audio-Guided Modality Alignment (AMA) module that groups visual features and merges them under audio guidance, and an Uncertainty Estimation (UE) module that accounts for frequent changes in sounding status using a Dirichlet-based uncertainty model. The AMA uses Density Peaks Clustering (DPC-KNN) to form semantically meaningful groups, applies multi-head cross-attention to produce compact, audio-responsive representations, and trains with a contrastive loss to separate sounding and silent regions. The UE module computes an uncertainty map from temporal attention and adjusts predictions by downweighting high-uncertainty regions, reducing mis-segmentation during sound transitions. Empirical results on AVS-Objects, AVS-Semantic, and VPO datasets show state-of-the-art performance, with notable gains in challenging settings, demonstrating improved robustness to visually similar yet acoustically different objects and dynamic sound states.
Abstract
Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model's attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.
