Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding
Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Jiankang Wang, Hongtao Xie
TL;DR
This work introduces Perception Loop Reasoning (PLR), a looped, timestamped perception framework for video understanding that mitigates perception shortcuts and hallucinations. Central to PLR is the Factual-Aware Evaluator (FAE), which provides an anti-hallucination reward and is trained on AnetHallu-117K to enforce factual video-evidence grounding during RL with GRPO. The authors construct VideoPLR-14K for cold-start training and demonstrate state-of-the-art performance at 3B and 7B scales with remarkable data efficiency, supported by extensive ablations and analysis of bias and reward design. The approach yields grounded, temporally-aware reasoning and offers practical contributions, including released code and datasets for the research community.
Abstract
Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.
