Table of Contents
Fetching ...

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, Tae-Hyun Oh

TL;DR

AVHBench introduces a dedicated cross-modal hallucination benchmark for audio-visual LLMs, addressing a critical gap where existing benchmarks focus on single modalities. The authors develop a semi-automatic dataset construction pipeline to generate four tasks—audio-driven video hallucination, video-driven audio hallucination, audio-visual matching, and audio-visual captioning—with real and synthetic samples to probe grounding and reasoning. Through evaluations of six contemporary AV-LLMs, the study reveals substantial cross-modal hallucinations, especially under multimodal inputs, and demonstrates that simple training with an annotation-enriched AVHBench dataset—combining audio feature alignment and LoRA fine-tuning—can significantly improve robustness. The work provides actionable insights into improving AV-LLM grounding and offers a generalizable path toward more reliable multimodal understanding in downstream applications. Overall, AVHBench serves as a valuable benchmark and training signal to advance robust audio-visual perception in large-language-augmented models.

Abstract

Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations and highlighting the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the cross-modal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations. Dataset: https://github.com/kaist-ami/AVHBench

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

TL;DR

AVHBench introduces a dedicated cross-modal hallucination benchmark for audio-visual LLMs, addressing a critical gap where existing benchmarks focus on single modalities. The authors develop a semi-automatic dataset construction pipeline to generate four tasks—audio-driven video hallucination, video-driven audio hallucination, audio-visual matching, and audio-visual captioning—with real and synthetic samples to probe grounding and reasoning. Through evaluations of six contemporary AV-LLMs, the study reveals substantial cross-modal hallucinations, especially under multimodal inputs, and demonstrates that simple training with an annotation-enriched AVHBench dataset—combining audio feature alignment and LoRA fine-tuning—can significantly improve robustness. The work provides actionable insights into improving AV-LLM grounding and offers a generalizable path toward more reliable multimodal understanding in downstream applications. Overall, AVHBench serves as a valuable benchmark and training signal to advance robust audio-visual perception in large-language-augmented models.

Abstract

Following the success of Large Language Models (LLMs), expanding their boundaries to new modalities represents a significant paradigm shift in multimodal understanding. Human perception is inherently multimodal, relying not only on text but also on auditory and visual cues for a complete understanding of the world. In recognition of this fact, audio-visual LLMs have recently emerged. Despite promising developments, the lack of dedicated benchmarks poses challenges for understanding and evaluating models. In this work, we show that audio-visual LLMs struggle to discern subtle relationships between audio and visual signals, leading to hallucinations and highlighting the need for reliable benchmarks. To address this, we introduce AVHBench, the first comprehensive benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs. Our benchmark includes tests for assessing hallucinations, as well as the cross-modal matching and reasoning abilities of these models. Our results reveal that most existing audio-visual LLMs struggle with hallucinations caused by cross-interactions between modalities, due to their limited capacity to perceive complex multimodal signals and their relationships. Additionally, we demonstrate that simple training with our AVHBench improves robustness of audio-visual LLMs against hallucinations. Dataset: https://github.com/kaist-ami/AVHBench

Paper Structure

This paper contains 52 sections, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Cross-modal hallucinations in audio-visual LLMs. Audio-visual LLMs tend to hallucinate by perceiving non-existent sounds when presented with visual signals (blue event) or imagining non-existent visual events when given an audio signal (orange event), as illustrated on the left. To comprehensively assess these phenomena, we propose AVHBench (depicted on the right), comprising 4 different tasks including the judgment (J) and description (D) tasks. The red objects/event+object in the questions are queried to evaluate the model's perception and robustness against audio-visual hallucinations.
  • Figure 2: Dataset statistics. Our AVHBench dataset comprises 2,136 videos featuring 4 different tasks, including 3 judgment tasks and 1 description task. A$\rightarrow$V, V$\rightarrow$A, A-V Mat., and A-V Cap. denote Audio-driven Video Hallucination, Video-driven Audio Hallucination, Audio-visual Matching, and Audio-visual Captioning respectively. Our dataset contains a total of 5,302 QnA pairs, evenly distributed between yes and no answers, along with 1,106 audio-visual captions.
  • Figure 3: Dataset construction pipeline. To design a comprehensive hallucination benchmark, we devise a dataset construction pipeline with automated procedures, consisting of two main stages: Stage 1 involves disentangling audio-visual information, and Stage 2 focuses on Question-and-Answer (QnA) generation for four different tasks. At the end of each stage, we verify and correct the automatically generated outputs by employing a minimal number of human annotators.
  • Figure 4: Input modality types.
  • Figure 5: Qualitative results. We illustrate visible and audible objects/events in the video (on the left). Green denotes the correct answers and Red denotes the incorrect answers produced by audio-visual LLMs. , , , and stand for our final model, PandaGPT pandagpt, Video-LLaMA videollama, and ChatBridge chatbridge, respectively.
  • ...and 9 more figures