Unveiling Visual Biases in Audio-Visual Localization Benchmarks
Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin
TL;DR
The paper identifies a pervasive visual bias in audio-visual source localization benchmarks, where sounding objects can often be inferred from visual context alone. It systematically analyzes two benchmarks, VGG-SS and Epic-Sounding-Object, using vision-only models (MiniGPT-v2 and HOID) to demonstrate that audio information is frequently unnecessary for correct localization. The findings show that visual cues enable accurate localization in the majority of cases and that vision-only methods can surpass state-of-the-art AVSL baselines, underscoring a bias that undermines benchmark validity. The work advocates benchmark refinement, such as data filtering or redesigned evaluation protocols, to ensure AVSL benchmarks truly assess audio-visual integration rather than vision priors.
Abstract
Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
