Table of Contents
Fetching ...

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin

TL;DR

The paper identifies a pervasive visual bias in audio-visual source localization benchmarks, where sounding objects can often be inferred from visual context alone. It systematically analyzes two benchmarks, VGG-SS and Epic-Sounding-Object, using vision-only models (MiniGPT-v2 and HOID) to demonstrate that audio information is frequently unnecessary for correct localization. The findings show that visual cues enable accurate localization in the majority of cases and that vision-only methods can surpass state-of-the-art AVSL baselines, underscoring a bias that undermines benchmark validity. The work advocates benchmark refinement, such as data filtering or redesigned evaluation protocols, to ensure AVSL benchmarks truly assess audio-visual integration rather than vision priors.

Abstract

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

TL;DR

The paper identifies a pervasive visual bias in audio-visual source localization benchmarks, where sounding objects can often be inferred from visual context alone. It systematically analyzes two benchmarks, VGG-SS and Epic-Sounding-Object, using vision-only models (MiniGPT-v2 and HOID) to demonstrate that audio information is frequently unnecessary for correct localization. The findings show that visual cues enable accurate localization in the majority of cases and that vision-only methods can surpass state-of-the-art AVSL baselines, underscoring a bias that undermines benchmark validity. The work advocates benchmark refinement, such as data filtering or redesigned evaluation protocols, to ensure AVSL benchmarks truly assess audio-visual integration rather than vision priors.

Abstract

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
Paper Structure (10 sections, 4 figures, 2 tables)

This paper contains 10 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of visual biases in AVSL benchmarks. Over 90% sound sources can be accurately identified using only visual information in each randomly sampled set of 300 videos from VGG-SS and Epic-Sounding-Object, respectively.
  • Figure 2: Qualitative results on VGG-SS. Top: well-performed cases; Bottom: failure cases.
  • Figure 3: Qualitative results on multi-source audio-visual localization data.
  • Figure 4: Qualitative results on Epic-Sounding-Object. Top: well-performed cases; Bottom: failure cases.