Table of Contents
Fetching ...

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

Linfeng Fan, Yuan Tian, Ziwei Li, Zhiwu Lu

Abstract

Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

Abstract

Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.

Paper Structure

This paper contains 20 sections, 17 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Given a single video encoding, STEAR first diagnoses risky decoding steps via token uncertainty, then selects token-conditioned key evidence shared by both middle-layer reinjection and deep-layer counterfactual decoding. This unified design addresses spatial hallucination by restoring missing grounding and suppresses temporal hallucination by falsifying temporally corrupted key evidence.
  • Figure 2: (a) Grounding score peaks in the middle decoder layers, while language dominance drops in the middle and rises again in the late layers, indicating a transition from evidence consolidation to language-dominant decoding. This suggests that hallucinations arise when token-relevant visual evidence is not sufficiently grounded before the model enters the late reasoning regime. (b) Middle-layer reinjection yields the largest gain, whereas others are ineffective.
  • Figure 3: For each risky token, KES aggregates middle-layer cross-attention to identify the top-$k$ key patches that best support the current prediction. The selected evidence is then reinjected into grounding-sensitive middle layers, restoring token-required but potentially neglected visual support without disturbing stable decoding steps.
  • Figure 4: Compared with Video-LLMs, STEAR better grounds local evidence and suppresses temporally inconsistent generations.