Table of Contents
Fetching ...

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang

TL;DR

This work targets hallucinations in VideoLLMs, with temporal misalignment as a core challenge. It introduces SEASON, a training-free framework that combines temporal homogenization to create temporally hard negatives and a self-diagnostic mechanism to assign token-level penalties, enabling adaptive contrastive decoding against temporal and spatial priors. By measuring frame-level attention divergence across original and negative video representations, SEASON generates per-token weights that guide logit-space contrastive decoding, achieving temporal faithfulness without retraining. Experiments across multiple backbones and benchmarks show state-of-the-art performance on hallucination reduction while preserving general video understanding, demonstrating strong practical impact for reliable video-language systems.

Abstract

Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

TL;DR

This work targets hallucinations in VideoLLMs, with temporal misalignment as a core challenge. It introduces SEASON, a training-free framework that combines temporal homogenization to create temporally hard negatives and a self-diagnostic mechanism to assign token-level penalties, enabling adaptive contrastive decoding against temporal and spatial priors. By measuring frame-level attention divergence across original and negative video representations, SEASON generates per-token weights that guide logit-space contrastive decoding, achieving temporal faithfulness without retraining. Experiments across multiple backbones and benchmarks show state-of-the-art performance on hallucination reduction while preserving general video understanding, demonstrating strong practical impact for reliable video-language systems.

Abstract

Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

Paper Structure

This paper contains 29 sections, 11 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Suppressing hallucination in video LLMs. (A) DINO-HEAL vidhalluc exploits spatial saliency but misses temporal order, (B) TCD eventhallusion contrasts frame-dropped videos but ignores causal relations, and (C) our SEASON achieves temporal faithfulness for each output token.
  • Figure 2: Overview of SEASON. Given the input video ($V$) and the question ($Q$), our proposed SEASON contrasts the original video representations ($v^O$) against our introduced spatial ($v^S$) and temporal ($v^T$) negatives to jointly achieve temporal and spatial faithfulness. Specifically, we design $v^T$ via the proposed "Temporal Homogenization", focusing on introducing temporal ambiguity while preserving spatial semantics. The "Self-Diagnostic Mechanism" computes token-level adaptive weights ($W^S, W^T$) by measuring attention divergence, dynamically steering the final decoding to penalize spatial or temporal hallucinations.
  • Figure 3: Illustration of Temporal Homogenization. This constructs the temporal negative $v^T$ by computing a layer-wise average of frame features ($d_1,...,d_l$) and progressively re-injecting this global context back into each frame's representation within the vision encoder. The resulting representation would be temporally ambiguous while preserving per-frame structure information.
  • Figure 4: Illustration of the Self-Diagnostic Mechanism. This process extracts the frame-level attention distribution ($\mathcal{A}_\textit{frame}$) from the preceding token. It computes JSD divergence between the attention distributions of the original video ($v^O$) and the negatives ($v^S$, $v^T$), outputting the adaptive spatial ($W^S$) and temporal ($W^T$) diagnostic weights to penalize spatial or temporal hallucination for each output token.
  • Figure 5: Qualitative visualization of SEASON's self-diagnostic mechanism. Qualitative visualization of SEASON's self-diagnostic weights ($W^T$ and $W^S$). In the generated text (the x-axis in the line plot), blue tokens are identified as relying on visual temporal cues; SEASON thus contrasts them against the temporal negative ($v^T$) to ensure token-level temporal faithfulness. For instance, tokens critical for temporal ordering like "B" (in (a)), as well as "A" and "first" ((in (b))) clearly receive high temporal weights ($W^T$) to ensure the sequence is correct. On the other hand, orange tokens rely on visual spatial cues and are contrasted against the spatial negative ($v^S$). This is evident as tokens describing objects and interactions, such as "placing butter...mixing bowl" in (a) and "hand...swirl batter" in (b), are assigned high spatial weights ($W^S$). Both (a) and (b) are samples from Vidhalluc vidhalluc.
  • ...and 10 more figures