Table of Contents
Fetching ...

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Junqi Yang, Yuecong Min, Jie Zhang, Shiguang Shan, Xilin Chen

Abstract

Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.

INFACT: A Diagnostic Benchmark for Induced Faithfulness and Factuality Hallucinations in Video-LLMs

Abstract

Despite rapid progress, Video Large Language Models (Video-LLMs) remain unreliable due to hallucinations, which are outputs that contradict either video evidence (faithfulness) or verifiable world knowledge (factuality). Existing benchmarks provide limited coverage of factuality hallucinations and predominantly evaluate models only in clean settings. We introduce \textsc{INFACT}, a diagnostic benchmark comprising 9{,}800 QA instances with fine-grained taxonomies for faithfulness and factuality, spanning real and synthetic videos. \textsc{INFACT} evaluates models in four modes: Base (clean), Visual Degradation, Evidence Corruption, and Temporal Intervention for order-sensitive items. Reliability under induced modes is quantified using Resist Rate (RR) and Temporal Sensitivity Score (TSS). Experiments on 14 representative Video-LLMs reveal that higher Base-mode accuracy does not reliably translate to higher reliability in the induced modes, with evidence corruption reducing stability and temporal intervention yielding the largest degradation. Notably, many open-source baselines exhibit near-zero TSS on factuality, indicating pronounced temporal inertia on order-sensitive questions.
Paper Structure (42 sections, 2 equations, 6 figures, 9 tables)

This paper contains 42 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Examples of Faithfulness and Factuality Hallucinations.Top: Faithfulness items are verified by video evidence, covering Static Entities & Attributes, Dynamic Actions & Motions, and Spatio-Temporal Relations. Bottom: Factuality items require consistency with world knowledge and cover Domain Knowledge (Know-WHAT), Procedural Knowledge (Know-HOW), and Physical Knowledge (Know-WHY).
  • Figure 2: Overview of the INFACT construction process.Left: Candidate videos and QA pairs are collected from multiple sources, including video QA datasets, instructional datasets, and synthetic videos. Middle: Samples are organized into fine-grained faithfulness and factuality dimensions, and filtered to remove ambiguous or non-video-grounded items, followed by human-in-the-loop quality verification. Right: The resulting benchmark supports four evaluation modes: Base, Visual Degradation, Evidence Corruption, and Temporal Intervention.
  • Figure 3: Dataset composition of INFACT. Distribution over the fine-grained taxonomy for faithfulness (top) and factuality (bottom).
  • Figure 4: Base accuracy vs. average reliability score under inductions. Base accuracy is measured in Mode I and averaged over faithfulness and factuality. The average reliability score under induction aggregates RR over Modes II--III and TSS over Mode IV.
  • Figure 5: Comparison of four representative models on fine-grained evaluation dimensions. The left radar plot shows performance on faithfulness dimensions, while the right radar plot shows performance on factuality dimensions. Each axis corresponds to a fine-grained category in the INFACT taxonomy, and each curve represents one representative model. Higher values indicate better performance on the corresponding dimension.
  • ...and 1 more figures