Table of Contents
Fetching ...

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

Chaoyu Li, Eun Woo Im, Pooyan Fazli

TL;DR

VidHalluc introduces the largest benchmark for evaluating temporal hallucinations in multimodal large language models (MLLMs) for video understanding, focusing on action, temporal sequence, and scene transition errors. It pairs semantically similar yet visually distinct videos to probe hallucinations and provides a semi-automatic data collection pipeline with human validation. To mitigate such hallucinations, the authors propose DINO-HEAL, a training-free method that reweights visual features using DINOv2 saliency, improving robustness across multiple backbones with an average gain of 3.02%. The work delivers both a comprehensive evaluation framework and a practical, training-free remedy, with public release of VidHalluc and DINO-HEAL code to facilitate further study and deployment in risk-sensitive video understanding tasks.

Abstract

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding

TL;DR

VidHalluc introduces the largest benchmark for evaluating temporal hallucinations in multimodal large language models (MLLMs) for video understanding, focusing on action, temporal sequence, and scene transition errors. It pairs semantically similar yet visually distinct videos to probe hallucinations and provides a semi-automatic data collection pipeline with human validation. To mitigate such hallucinations, the authors propose DINO-HEAL, a training-free method that reweights visual features using DINOv2 saliency, improving robustness across multiple backbones with an average gain of 3.02%. The work delivers both a comprehensive evaluation framework and a practical, training-free remedy, with public release of VidHalluc and DINO-HEAL code to facilitate further study and deployment in risk-sensitive video understanding tasks.

Abstract

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VidHalluc, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VidHalluc assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a training-free method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VidHalluc, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VidHalluc benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

Paper Structure

This paper contains 35 sections, 17 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: An example of a video pair showing action hallucination in VidHalluc. Adversarial questions, which refer to actions in the other video of the pair, show that MLLMs are prone to hallucinations when input videos have high semantic but low visual similarity. None of the models accurately identify the prominent actions in this pair.
  • Figure 2: Overview of the VidHalluc benchmark construction process. Candidate video pairs are selected based on high semantic similarity and low visual similarity. GPT-4 is then used to generate action and scene annotations from the captions of each video clip. Human reviewers manually filter out pairs where GPT-4 annotations were incorrect or where actions/scenes did not align between clips. Finally, video pairs that pass this filtering process are used to automatically generate three types of hallucination questions: action hallucination, time-sequence hallucination, and scene-transition hallucination.
  • Figure 3: Examples of three hallucination types in the VidHalluc benchmark: (1) action hallucination, where the model detects actions in a video that significantly differ from the actual actions; (2) temporal sequence hallucination, where the model fails to represent the correct temporal order of events in a video; and (3) scene transition hallucination, where the model inaccurately describes transitions between distinct scenes within a video.
  • Figure 4: The distribution of video durations in the VidHalluc benchmark.
  • Figure 5: DINO-HEAL pipeline. Since DINOv2 effectively captures salient regions in the input video, we leverage it to guide the reweighting of the attention given to different spatial regions within the feature from the visual encoder.
  • ...and 3 more figures