VGGSounder: Audio-Visual Evaluations for Foundation Models
Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke
TL;DR
We address the problem of modality-aware evaluation for audio-visual foundation models by expanding VGGSound into a multi-label, modality-annotated benchmark. VGGSounder adds per-label modality annotations, meta-labels (background music, voice-over, static images), and synonym/superclass expansions via a hybrid human+automatic annotation pipeline. A new modality-confusion metric $\mu$ reveals that many models are distracted by an extra modality and that embedding models rely more on audio while foundation models lean on vision. The benchmark enables richer, more reliable profiling of audio-visual understanding and provides a resource to guide mitigation strategies and future benchmark design.
Abstract
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.
