Table of Contents
Fetching ...

NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

Kyuho Lee, Euntae Kim, Jinwoo Choi, Buru Chang

TL;DR

The paper identifies narrative priors as a key source of hallucination and omission in Video LLMs, proposing NOAH, a benchmark that constructs composite videos by inserting clips into target videos with controlled semantic similarity and insertion positions. NOAH supports a captioning task and three QA tasks (Existence, Temporal, Narrative), generating over 60K evaluation samples to dissect how coherence biases affect factual grounding. Across open- and closed-source models, the study finds pervasive narrative-prior-driven errors, with patterns varying by architecture, similarity, and insertion position, and more pronounced when temporal context is reduced. NOAH provides a standardized framework to diagnose and mitigate narrative-prior distortions, offering a foundation for developing more faithful Video LLMs and guiding future research toward grounding-focused improvements.

Abstract

Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.

NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

TL;DR

The paper identifies narrative priors as a key source of hallucination and omission in Video LLMs, proposing NOAH, a benchmark that constructs composite videos by inserting clips into target videos with controlled semantic similarity and insertion positions. NOAH supports a captioning task and three QA tasks (Existence, Temporal, Narrative), generating over 60K evaluation samples to dissect how coherence biases affect factual grounding. Across open- and closed-source models, the study finds pervasive narrative-prior-driven errors, with patterns varying by architecture, similarity, and insertion position, and more pronounced when temporal context is reduced. NOAH provides a standardized framework to diagnose and mitigate narrative-prior distortions, offering a foundation for developing more faithful Video LLMs and guiding future research toward grounding-focused improvements.

Abstract

Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.

Paper Structure

This paper contains 24 sections, 3 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Examples of hallucinations and omissions induced by narrative priors, generated by BLIP-3-Video ryoo2024xgen. (a) A hallucinated caption is generated to maintain coherence between distinct events. (b) An event distinct from the others is omitted to preserve narrative continuity.
  • Figure 2: Overview of data construction. Candidate clips are ranked by CLIP cosine similarity, with high-, medium-, and low-similarity clips inserted at the start, middle, or end of each target video, yielding $3 \times 3 = 9$ composite variants per video to study narrative prior–induced errors.
  • Figure 3: Overview of four evaluation tasks. (1) Captioning assesses hallucination and omission in video descriptions; (2) Existence QA tests whether models correctly distinguish real inserted events from distractor events; (3) Temporal QA evaluates understanding of event order; (4) Narrative QA checks whether models reject fabricated but plausible events.
  • Figure 4: Heatmaps of captioning metrics across similarity and insertion position.
  • Figure 5: KL Divergence across insertion positions.
  • ...and 3 more figures