SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding
Chang-Hsun Wu, Kai-Po Chang, Yu-Yang Sheng, Hung-Kai Chung, Kuei-Chun Wang, Yu-Chiang Frank Wang
TL;DR
This work targets hallucinations in VideoLLMs, with temporal misalignment as a core challenge. It introduces SEASON, a training-free framework that combines temporal homogenization to create temporally hard negatives and a self-diagnostic mechanism to assign token-level penalties, enabling adaptive contrastive decoding against temporal and spatial priors. By measuring frame-level attention divergence across original and negative video representations, SEASON generates per-token weights that guide logit-space contrastive decoding, achieving temporal faithfulness without retraining. Experiments across multiple backbones and benchmarks show state-of-the-art performance on hallucination reduction while preserving general video understanding, demonstrating strong practical impact for reliable video-language systems.
Abstract
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
