Table of Contents
Fetching ...

VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke

TL;DR

We address the problem of modality-aware evaluation for audio-visual foundation models by expanding VGGSound into a multi-label, modality-annotated benchmark. VGGSounder adds per-label modality annotations, meta-labels (background music, voice-over, static images), and synonym/superclass expansions via a hybrid human+automatic annotation pipeline. A new modality-confusion metric $\mu$ reveals that many models are distracted by an extra modality and that embedding models rely more on audio while foundation models lean on vision. The benchmark enables richer, more reliable profiling of audio-visual understanding and provides a resource to guide mitigation strategies and future benchmark design.

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

VGGSounder: Audio-Visual Evaluations for Foundation Models

TL;DR

We address the problem of modality-aware evaluation for audio-visual foundation models by expanding VGGSound into a multi-label, modality-annotated benchmark. VGGSounder adds per-label modality annotations, meta-labels (background music, voice-over, static images), and synonym/superclass expansions via a hybrid human+automatic annotation pipeline. A new modality-confusion metric reveals that many models are distracted by an extra modality and that embedding models rely more on audio while foundation models lean on vision. The benchmark enables richer, more reliable profiling of audio-visual understanding and provides a resource to guide mitigation strategies and future benchmark design.

Abstract

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.

Paper Structure

This paper contains 29 sections, 1 equation, 11 figures, 18 tables.

Figures (11)

  • Figure 1: We introduce VGGSounder, a multi-label audio-visual classification benchmark with modality annotations. We extend the original VGGSound test set with human-annotated audible, visible, and visible$+$audible labels. We add meta labels for common confounders, such as background music. We benchmark eleven recent audio-visual models on VGGSounder. It enables selective analysis of a model’s auditory and visual capabilities on classes relevant for the queried modality.
  • Figure 2: Limitations of VGGSound. We show video frames from videos in the VGGSound test set along with their annotated label (grey) to demonstrate various limitations. A. VGGSound samples are labelled with a single class, yet many videos contain multiple distinct classes. B. Additionally, many classes partially overlap or are ambiguous. C. Some samples are labelled with classes that are not present in one of the modalities (i.e., the labelled class is not visible or audible).
  • Figure 3: Overview of VGGSounder. A. Most samples contain more than one label. B. More than a quarter of labels are audible but not visible. In contrast, only a tiny fraction is visible but not audible. C. Speech and bird sounds are the most common classes; more details can be found in \ref{['sec:class-label-frequency-in-vggsounder']}. D. Forty percent of the samples contain some combination of background music, voice over, and static image(s), making the classification task significantly harder.
  • Figure 4: VGGSounder more accurately captures model performance across input modalities. We show the Hit score on VGGSounder and accuracy on VGGSound, normalised by the per-model maximum performance on each benchmark. Specifically for foundation models, we observe a significant difference in performance between VGGSound and VGGSounder.
  • Figure 5: Interface used to annotate the gold standard set in-house.
  • ...and 6 more figures