Table of Contents
Fetching ...

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Hongcheng Gao, Jiashu Qu, Jingyi Tang, Baolong Bi, Yue Liu, Hongyu Chen, Li Liang, Li Su, Qingming Huang

TL;DR

This work tackles the challenge of hallucination in large multimodal models applied to video understanding by introducing HAVEN, a comprehensive benchmark built around three causes, three aspects, and three question formats, comprising 6,497 questions across 16 models. It analyzes how factors such as video duration, frame sampling, and model size influence hallucination and consistency, and finds that in-context conflicts and question length are particularly detrimental. To mitigate hallucinations, the authors propose a video-thinking paradigm with supervised reasoning fine-tuning (SRFT) and segment-weighted direct preference optimization (TDPO), which yield a 7.65% accuracy improvement on hallucination evaluation and a 4.5% reduction in bias for consistency, exemplified by LLaVA-NeXT-Video-DPO-7B and related variants. The work provides actionable insights and releases code and data to advance robust video-language understanding in LMMs.

Abstract

The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

TL;DR

This work tackles the challenge of hallucination in large multimodal models applied to video understanding by introducing HAVEN, a comprehensive benchmark built around three causes, three aspects, and three question formats, comprising 6,497 questions across 16 models. It analyzes how factors such as video duration, frame sampling, and model size influence hallucination and consistency, and finds that in-context conflicts and question length are particularly detrimental. To mitigate hallucinations, the authors propose a video-thinking paradigm with supervised reasoning fine-tuning (SRFT) and segment-weighted direct preference optimization (TDPO), which yield a 7.65% accuracy improvement on hallucination evaluation and a 4.5% reduction in bias for consistency, exemplified by LLaVA-NeXT-Video-DPO-7B and related variants. The work provides actionable insights and releases code and data to advance robust video-language understanding in LMMs.

Abstract

The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.

Paper Structure

This paper contains 45 sections, 2 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Construction protocol of HAVEN. The left section outlines the three dimensions of data construction and the associated categories within each, while the right section details the evaluation process and metrics.
  • Figure 2: Distribution of duration time, frame count, and question length.
  • Figure 3: Question format distribution. Percentage share of each format-binary-choice (T/F), multiple-choice (MC), and short-answer (SA)—and the proportion occupied by the detailed answer.
  • Figure 4: The impact of video duration, frame count, and question length on LLM hallucination.
  • Figure 5: Accuracy heatmap along two dimensions.
  • ...and 6 more figures