DAVE: Diagnostic benchmark for Audio Visual Evaluation
Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars
TL;DR
DAVE introduces a diagnostic benchmark for audio-visual understanding in AV-LLMs to counter visual bias in prior datasets and enable fine-grained, cross-modal evaluation.The dataset combines egocentric video, controlled overlaid sounds, and a three-task AVQA framework plus three atomic subtasks, with a semi-automatic data generation pipeline and stringent quality filtering.Across diverse models, humans substantially outperform current AV-LLMs, which struggle with sound absence and discrimination and rely heavily on visual cues for action recognition; ablations show genuine multimodal integration is required.The work highlights the need for explicit temporal alignment mechanisms and training objectives that promote mismatch detection and robust cross-modal reasoning, positioning DAVE as a valuable diagnostic tool for future AV-LLMs.
Abstract
Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. Dataset: https://huggingface.co/datasets/gorjanradevski/dave Code: https://github.com/gorjanradevski/dave
