Table of Contents
Fetching ...

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Jia Li, Wenjie Zhao, Ziru Huang, Yunhui Guo, Yapeng Tian

TL;DR

The paper tackles the problem that audio-visual segmentation (AVS) models often rely on visual salience rather than true audio-visual integration, leading to incorrect predictions when audio cues are absent or irrelevant. It introduces AVSBench-Robust, a benchmark with diverse negative audio conditions (silence, ambient noise, off-screen sounds) and two evaluation splits (S4 and MS3), plus a simple debiasing framework that combines balanced positive/negative audio-visual pairs with classifier-guided similarity learning and joint segmentation. Empirical results show that state-of-the-art methods exhibit strong visual bias under negative audio, while the proposed method achieves near-perfect false positive suppression and robust segmentation on both standard AVS benchmarks and challenging negative scenarios. The work offers a practical training strategy to improve AVS reliability in real-world multimodal scenarios and highlights the importance of evaluating robustness to negative audio in multimodal perception tasks.

Abstract

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that state-of-theart AVS methods consistently fail under negative audio conditions, demonstrating the prevalence of visual bias. In contrast, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving highquality segmentation performance.

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

TL;DR

The paper tackles the problem that audio-visual segmentation (AVS) models often rely on visual salience rather than true audio-visual integration, leading to incorrect predictions when audio cues are absent or irrelevant. It introduces AVSBench-Robust, a benchmark with diverse negative audio conditions (silence, ambient noise, off-screen sounds) and two evaluation splits (S4 and MS3), plus a simple debiasing framework that combines balanced positive/negative audio-visual pairs with classifier-guided similarity learning and joint segmentation. Empirical results show that state-of-the-art methods exhibit strong visual bias under negative audio, while the proposed method achieves near-perfect false positive suppression and robust segmentation on both standard AVS benchmarks and challenging negative scenarios. The work offers a practical training strategy to improve AVS reliability in real-world multimodal scenarios and highlights the importance of evaluating robustness to negative audio in multimodal perception tasks.

Abstract

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios including silence, ambient noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that state-of-theart AVS methods consistently fail under negative audio conditions, demonstrating the prevalence of visual bias. In contrast, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving highquality segmentation performance.

Paper Structure

This paper contains 18 sections, 14 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Performance in Different Audio Scenarios. The top row shows an ambulance image under different audio conditions: Ambulance sound (positive), Silence, Noise, and Offscreen sounds (negative). Each subsequent row displays the segmentation output various SOTA AVS models zhou2022audiogao2024avsegformerchen2024cavp and our model under each audio condition. In negative scenarios, existing models segment the ambulance due to "visual prior" bias, mistakenly associating it with unrelated audio. In contrast, our model accurately segments only in the presence of relevant audio, demonstrating improved alignment between audio cues and visual segmentation.
  • Figure 2: Framework Overview. Given video frames and an audio clip as inputs, our approach can robustly identify and segment sounding objects in video frames. Positive audio-visual pairs represent aligned sound sources, while negative pairs, such as silence or offscreen sounds, correspond to empty masks. The model uses separate visual and audio encoders to extract modality-specific features, applies similarity-based alignment optimized with classifier guidance in a contrastive manner, and integrates features through a fusion module. Positive pairs maximize similarity, while negative pairs minimize it, using a small portion (10%) of the dataset for improved boundary delineation. This dual-stream design facilitates segmentation by distinguishing sound-relevant regions in complex scenes.
  • Figure 3: Performance comparison of different AVS models under various audio conditions on Robust-S4 dataset. Existing SOTA methods liu2024annofreema2024steppingchen2024cavp segment objects primarily based on visual salience, exhibiting a strong visual bias. In contrast, our approach achieves accurate segmentation with original audio while successfully reject predict in negative scenarios (e.g., silence, noise, off-screen).
  • Figure 4: Cosine similarity distributions between paired features before and after training.(a) Positive and negative pairs exhibit similar distributions, indicating the model’s limited ability to distinguish audio-visual correspondence. (b) After training with classifier-guided similarity learning, the distributions are well-separated, demonstrating the model's enhanced capability to identify valid audio-visual pairs.
  • Figure 5: Performance comparison of different AVS models under various audio conditions on Robust-MS3 dataset. Existing SOTA methods liu2024annofreema2024steppingchen2024cavp segment objects primarily based on visual salience, exhibiting a strong visual bias. In contrast, our approach achieves accurate segmentation with original audio while successfully reject predict in negative scenarios (e.g., silence, noise, off-screen).
  • ...and 4 more figures