Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Liangyu Chen; Zihao Yue; Boshen Xu; Qin Jin

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin

TL;DR

The paper identifies a pervasive visual bias in audio-visual source localization benchmarks, where sounding objects can often be inferred from visual context alone. It systematically analyzes two benchmarks, VGG-SS and Epic-Sounding-Object, using vision-only models (MiniGPT-v2 and HOID) to demonstrate that audio information is frequently unnecessary for correct localization. The findings show that visual cues enable accurate localization in the majority of cases and that vision-only methods can surpass state-of-the-art AVSL baselines, underscoring a bias that undermines benchmark validity. The work advocates benchmark refinement, such as data filtering or redesigned evaluation protocols, to ensure AVSL benchmarks truly assess audio-visual integration rather than vision priors.

Abstract

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

TL;DR

Abstract

Paper Structure (10 sections, 4 figures, 2 tables)

This paper contains 10 sections, 4 figures, 2 tables.

Introduction
Related Works
VGG-SS: Sounding Bias from Visual Common Sense
Observation and Analysis
Experiments
Epic-Sounding-Object: Sounding Bias from Hand-Object Interaction
Observation and Analysis
Experiments
Discussion
Conclusion

Figures (4)

Figure 1: Illustration of visual biases in AVSL benchmarks. Over 90% sound sources can be accurately identified using only visual information in each randomly sampled set of 300 videos from VGG-SS and Epic-Sounding-Object, respectively.
Figure 2: Qualitative results on VGG-SS. Top: well-performed cases; Bottom: failure cases.
Figure 3: Qualitative results on multi-source audio-visual localization data.
Figure 4: Qualitative results on Epic-Sounding-Object. Top: well-performed cases; Bottom: failure cases.

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

TL;DR

Abstract

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)