Table of Contents
Fetching ...

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

Chen Liu, Peike Li, Liying Yang, Dadong Wang, Lincheng Li, Xin Yu

TL;DR

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment addresses ambiguous audio-visual correspondences by introducing an Audio-Guided Modality Alignment (AMA) module that groups visual features and merges them under audio guidance, and an Uncertainty Estimation (UE) module that accounts for frequent changes in sounding status using a Dirichlet-based uncertainty model. The AMA uses Density Peaks Clustering (DPC-KNN) to form semantically meaningful groups, applies multi-head cross-attention to produce compact, audio-responsive representations, and trains with a contrastive loss to separate sounding and silent regions. The UE module computes an uncertainty map from temporal attention and adjusts predictions by downweighting high-uncertainty regions, reducing mis-segmentation during sound transitions. Empirical results on AVS-Objects, AVS-Semantic, and VPO datasets show state-of-the-art performance, with notable gains in challenging settings, demonstrating improved robustness to visually similar yet acoustically different objects and dynamic sound states.

Abstract

Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model's attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment

TL;DR

Robust Audio-Visual Segmentation via Audio-Guided Visual Convergent Alignment addresses ambiguous audio-visual correspondences by introducing an Audio-Guided Modality Alignment (AMA) module that groups visual features and merges them under audio guidance, and an Uncertainty Estimation (UE) module that accounts for frequent changes in sounding status using a Dirichlet-based uncertainty model. The AMA uses Density Peaks Clustering (DPC-KNN) to form semantically meaningful groups, applies multi-head cross-attention to produce compact, audio-responsive representations, and trains with a contrastive loss to separate sounding and silent regions. The UE module computes an uncertainty map from temporal attention and adjusts predictions by downweighting high-uncertainty regions, reducing mis-segmentation during sound transitions. Empirical results on AVS-Objects, AVS-Semantic, and VPO datasets show state-of-the-art performance, with notable gains in challenging settings, demonstrating improved robustness to visually similar yet acoustically different objects and dynamic sound states.

Abstract

Accurately localizing audible objects based on audio-visual cues is the core objective of audio-visual segmentation. Most previous methods emphasize spatial or temporal multi-modal modeling, yet overlook challenges from ambiguous audio-visual correspondences such as nearby visually similar but acoustically different objects and frequent shifts in objects' sounding status. Consequently, they may struggle to reliably correlate audio and visual cues, leading to over- or under-segmentation. To address these limitations, we propose a novel framework with two primary components: an audio-guided modality alignment (AMA) module and an uncertainty estimation (UE) module. Instead of indiscriminately correlating audio-visual cues through a global attention mechanism, AMA performs audio-visual interactions within multiple groups and consolidates group features into compact representations based on their responsiveness to audio cues, effectively directing the model's attention to audio-relevant areas. Leveraging contrastive learning, AMA further distinguishes sounding regions from silent areas by treating features with strong audio responses as positive samples and weaker responses as negatives. Additionally, UE integrates spatial and temporal information to identify high-uncertainty regions caused by frequent changes in sound state, reducing prediction errors by lowering confidence in these areas. Experimental results demonstrate that our approach achieves superior accuracy compared to existing state-of-the-art methods, particularly in challenging scenarios where traditional approaches struggle to maintain reliable segmentation.

Paper Structure

This paper contains 15 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Illustration of ambiguous spatio-temporal correspondences.Case (1): At time $t_1$, #Dog1 and #Dog2 are positioned closely but have differing sounding states, challenging the model to identify the genuine sounding one. Case (2): Over frames $t_1$, $t_2$, and $t_3$, #Dog1 switches between sounding and silent states, posing challenges for models to capture the object’s sounding status variations over time reliably. (b) Distribution of Special Cases in AVSS Dataset. We conduct a sample analysis utilizing a random 33.3% subset of AVSS-V2 zhou2024audio, which reveals substantial occurrences of cases (1) and (2), indicating the frequent presence of challenging frames.
  • Figure 2: Method Overview. Our framework takes video frames $\{I_t\}_{t=1}^T$ and audio signals $\{A_t\}_{t=1}^T$ as input to segment masks $\{\hat{Y}_t\}_{t=1}^T$ for audible objects. Visual and audio features extracted by the visual block $\mathcal{E}_{v_l}$ and audio encoder $\mathcal{E}_{a}$ are aligned through audio-guided modality alignment. The multi-scale features from each frame are then fed into the mask decoder to generate a fused feature map. Through temporal modeling, the feature maps are processed by the uncertainty estimation module to obtain the uncertainty map and mask confidence predictions. The final predicted results are generated by integrating the uncertainty map with mask confidence predictions.
  • Figure 3: (a) Image features are first grouped based on their semantic similarity. Audio and visual features interact at the group level, where features within each group are merged into compact representations guided by the audio signal. Through multiple layers of interaction, the sounding regions are progressively highlighted. The compact representations from the final layer are then used to perform contrastive learning with audio cues. (b) Guided by audio features, the features within each group merge into compact semantic representations. These grouped semantics are then remapped onto the feature map to perform the next level of alignment.
  • Figure 4: Visual comparison of challenging cases (illustrated ❶ and ❷ in § \ref{['sec:intro']}) in AVSBench-Semantic. Refer to § \ref{['sec:qualitative_com']} for detailed analysis.
  • Figure 5: Visualization of positive (Pos.) and negative (Neg.) samples generated by the AMA module. Red indicates negative samples, while blue represents positive samples. Darker blue indicates regions with a higher responsiveness to audio cues, while darker red indicates lower responsiveness. See detailed analysis in §\ref{['sec:further_analysis']}.
  • ...and 1 more figures