Table of Contents
Fetching ...

DAVE: Diagnostic benchmark for Audio Visual Evaluation

Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars

TL;DR

DAVE introduces a diagnostic benchmark for audio-visual understanding in AV-LLMs to counter visual bias in prior datasets and enable fine-grained, cross-modal evaluation.The dataset combines egocentric video, controlled overlaid sounds, and a three-task AVQA framework plus three atomic subtasks, with a semi-automatic data generation pipeline and stringent quality filtering.Across diverse models, humans substantially outperform current AV-LLMs, which struggle with sound absence and discrimination and rely heavily on visual cues for action recognition; ablations show genuine multimodal integration is required.The work highlights the need for explicit temporal alignment mechanisms and training objectives that promote mismatch detection and robust cross-modal reasoning, positioning DAVE as a valuable diagnostic tool for future AV-LLMs.

Abstract

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. Dataset: https://huggingface.co/datasets/gorjanradevski/dave Code: https://github.com/gorjanradevski/dave

DAVE: Diagnostic benchmark for Audio Visual Evaluation

TL;DR

DAVE introduces a diagnostic benchmark for audio-visual understanding in AV-LLMs to counter visual bias in prior datasets and enable fine-grained, cross-modal evaluation.The dataset combines egocentric video, controlled overlaid sounds, and a three-task AVQA framework plus three atomic subtasks, with a semi-automatic data generation pipeline and stringent quality filtering.Across diverse models, humans substantially outperform current AV-LLMs, which struggle with sound absence and discrimination and rely heavily on visual cues for action recognition; ablations show genuine multimodal integration is required.The work highlights the need for explicit temporal alignment mechanisms and training objectives that promote mismatch detection and robust cross-modal reasoning, positioning DAVE as a valuable diagnostic tool for future AV-LLMs.

Abstract

Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- when answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE: Diagnostic Audio Visual Evaluation, a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled settings. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. Dataset: https://huggingface.co/datasets/gorjanradevski/dave Code: https://github.com/gorjanradevski/dave

Paper Structure

This paper contains 37 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Existing benchmarks (e.g., AVQA yang2022avqa) suffer from visual bias (left) while Dave (right) contains questions which are impossible to solve without both modalities.
  • Figure 2: Performance comparison of AV-LLMs on AVQA yang2022avqa with different modalities. AVQA's questions exhibit a strong visual bias: performance using Video Only and Audio+Video input is consistently similar. Error bars represent standard deviations.
  • Figure 3: Dataset statistics. Left: Overview of the number of samples in Dave, consisting of 2426 samples across three tasks and two source datasets. Middle: The 10 most common scenarios (around 78% of the data) in our benchmark. Right: Distribution of labels.
  • Figure 4: Illustration of the multimodal tasks in Dave. Left: Multimodal synchronisation tests if models can correctly identify actions occurring simultaneously with a specific sound (e.g., siren). Center: Sound absence detection evaluates if models can recognize when a queried sound (e.g., car horn) is absent. Right: Sound discrimination assesses if models can distinguish between different sound types (e.g., distinguishing between dog and coughing sounds) and avoid incorrect associations. Each task is multi-choice (correct answer with dashed line).
  • Figure 5: Left. Model performance on Dave's composite task vs. atomic component tasks. We report accuracy (%) on the primary multimodal syncronisation task alongside performance on the constituent capabilities: temporal ordering, audio classification, and action recognition. This analysis reveals whether failures stem from weak component capabilities or true integration challenges (see Table \ref{['tab:tasks']}). Right. Impact of modality availability on Dave performance. We report accuracy when models have access to different modality combinations: full multimodal input (Audio + Video + Text), Video + Text, and Audio + Text. The performance degradation without all modalities demonstrates Dave's effectiveness at requiring genuine cross-modal reasoning (see Table \ref{['tab:modalities']}).
  • ...and 2 more figures