Table of Contents
Fetching ...

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Wey Yeh Choong, Yangyang Guo, Mohan Kankanhalli

TL;DR

VidHal tackles video-based hallucinations in Vision LLMs by introducing a dedicated benchmark that spans diverse temporal aspects and a caption-ordering task for fine-grained evaluation. The dataset comprises 1,000 videos with an anchor caption and M-1 hallucinatory captions generated via GPT-4o, validated for reliability. The evaluation combines MCQA and a NDCG-based caption ranking to capture coarse and nuanced hallucinations, including transitivity and image-prior effects. Across thirteen VLLMs, VidHal reveals substantial gaps in temporal understanding, with larger and proprietary models offering stronger performance and ordering tasks proving more challenging than MCQA, pointing to directions for improved temporal reasoning and hallucination mitigation.

Abstract

Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

TL;DR

VidHal tackles video-based hallucinations in Vision LLMs by introducing a dedicated benchmark that spans diverse temporal aspects and a caption-ordering task for fine-grained evaluation. The dataset comprises 1,000 videos with an anchor caption and M-1 hallucinatory captions generated via GPT-4o, validated for reliability. The evaluation combines MCQA and a NDCG-based caption ranking to capture coarse and nuanced hallucinations, including transitivity and image-prior effects. Across thirteen VLLMs, VidHal reveals substantial gaps in temporal understanding, with larger and proprietary models offering stronger performance and ordering tasks proving more challenging than MCQA, pointing to directions for improved temporal reasoning and hallucination mitigation.

Abstract

Vision Large Language Models (VLLMs) are widely acknowledged to be prone to hallucinations. Existing research addressing this problem has primarily been confined to image inputs, with limited exploration of video-based hallucinations. Furthermore, current evaluation methods fail to capture nuanced errors in generated responses, which are often exacerbated by the rich spatiotemporal dynamics of videos. To address this, we introduce VidHal, a benchmark specially designed to evaluate video-based hallucinations in VLLMs. VidHal is constructed by bootstrapping video instances across a wide range of common temporal aspects. A defining feature of our benchmark lies in the careful creation of captions which represent varying levels of hallucination associated with each video. To enable fine-grained evaluation, we propose a novel caption ordering task requiring VLLMs to rank captions by hallucinatory extent. We conduct extensive experiments on VidHal and comprehensively evaluate a broad selection of models. Our results uncover significant limitations in existing VLLMs regarding hallucination generation. Through our benchmark, we aim to inspire further research on 1) holistic understanding of VLLM capabilities, particularly regarding hallucination, and 2) extensive development of advanced VLLMs to alleviate this problem.

Paper Structure

This paper contains 46 sections, 7 equations, 37 figures, 4 tables.

Figures (37)

  • Figure 1: Multiple-Choice Question Answering (MCQA) performance of representative VLLMs on our VidHal benchmark. (Left) Overall ranking of VLLMs. (Right) Detailed accuracy results pertaining to each temporal aspect, wherein higher scores indicate fewer hallucinations.
  • Figure 2: Overview of our VidHal benchmark construction pipeline. Using direction as an example from the five selected aspects, we begin by sourcing relevant video instances from existing datasets. Next, the anchor (positive) caption is generated from the original video metadata. Finally, GPT-4o is employed to generate hallucinatory captions at varying levels.
  • Figure 3: Human agreement on hallucination levels in the VidHal dataset. (Left) Distribution of agreement ratios per video sample. (Right) Average agreement ratio for each temporal aspect, with an overall average of 87%.
  • Figure 4: Visual illustration of relative caption ordering task in VidHal. The final ordering is parsed based on VLLM responses for each pair order queried.
  • Figure 5: Aspect-aware results of VLLMs for the (Left) naive and (Right) relative caption ordering task. The dotted lines represent the average NDCG scores across all models.
  • ...and 32 more figures