NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

Kyuho Lee; Euntae Kim; Jinwoo Choi; Buru Chang

NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

Kyuho Lee, Euntae Kim, Jinwoo Choi, Buru Chang

TL;DR

The paper identifies narrative priors as a key source of hallucination and omission in Video LLMs, proposing NOAH, a benchmark that constructs composite videos by inserting clips into target videos with controlled semantic similarity and insertion positions. NOAH supports a captioning task and three QA tasks (Existence, Temporal, Narrative), generating over 60K evaluation samples to dissect how coherence biases affect factual grounding. Across open- and closed-source models, the study finds pervasive narrative-prior-driven errors, with patterns varying by architecture, similarity, and insertion position, and more pronounced when temporal context is reduced. NOAH provides a standardized framework to diagnose and mitigate narrative-prior distortions, offering a foundation for developing more faithful Video LLMs and guiding future research toward grounding-focused improvements.

Abstract

Video large language models (Video LLMs) have recently achieved strong performance on tasks such as captioning, summarization, and question answering. Many models and training methods explicitly encourage continuity across events to enhance narrative coherence. While this improves fluency, it also introduces an inductive bias that prioritizes storyline consistency over strict grounding in visual evidence. We identify this bias, which we call narrative prior, as a key driver of two errors: hallucinations, where non-existent events are introduced or existing ones are misinterpreted, and omissions, where factual events are suppressed because they are misaligned with surrounding context. To systematically evaluate narrative prior-induced errors, we introduce NOAH, a large-scale benchmark that constructs composite videos by inserting clips from other sources into target videos. By varying semantic similarity and insertion position, our benchmark enables controlled and scalable analysis of narrative priors. We design one captioning task with tailored metrics and three QA tasks - Existence, Temporal, and Narrative - yielding more than 60K evaluation samples. Extensive experiments yield three key findings: (i) most Video LLMs exhibit hallucinations and omissions driven by narrative priors, (ii) the patterns of these errors vary across architectures and depend on event similarity and insertion position, and (iii) reliance on narrative priors intensifies under sampling with fewer frames, amplifying errors when event continuity is weak. We establish NOAH as the first standardized evaluation of narrative prior-induced hallucination and omission in Video LLMs, providing a foundation for developing more reliable and trustworthy models. Our benchmark and code are available at https://anonymous550520.github.io/.

NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

TL;DR

Abstract

NOAH: Benchmarking Narrative Prior driven Hallucination and Omission in Video Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)