Table of Contents
Fetching ...

Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Jiankang Wang, Hongtao Xie

TL;DR

This work introduces Perception Loop Reasoning (PLR), a looped, timestamped perception framework for video understanding that mitigates perception shortcuts and hallucinations. Central to PLR is the Factual-Aware Evaluator (FAE), which provides an anti-hallucination reward and is trained on AnetHallu-117K to enforce factual video-evidence grounding during RL with GRPO. The authors construct VideoPLR-14K for cold-start training and demonstrate state-of-the-art performance at 3B and 7B scales with remarkable data efficiency, supported by extensive ablations and analysis of bias and reward design. The approach yields grounded, temporally-aware reasoning and offers practical contributions, including released code and datasets for the research community.

Abstract

Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.

Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

TL;DR

This work introduces Perception Loop Reasoning (PLR), a looped, timestamped perception framework for video understanding that mitigates perception shortcuts and hallucinations. Central to PLR is the Factual-Aware Evaluator (FAE), which provides an anti-hallucination reward and is trained on AnetHallu-117K to enforce factual video-evidence grounding during RL with GRPO. The authors construct VideoPLR-14K for cold-start training and demonstrate state-of-the-art performance at 3B and 7B scales with remarkable data efficiency, supported by extensive ablations and analysis of bias and reward design. The approach yields grounded, temporally-aware reasoning and offers practical contributions, including released code and datasets for the research community.

Abstract

Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.

Paper Structure

This paper contains 36 sections, 18 equations, 38 figures, 10 tables, 1 algorithm.

Figures (38)

  • Figure 1: Overview of VideoPLR.(a) Compared with reasoning paradigms. VideoPLR performs iterative perception and reasoning, and each perception execution targets the perception results of the previous reasoning. During training, it is allowed to individually judge whether hallucinations occur in each perception result and provide rewards. (b) Examples of VideoPLR. The PLR paradigm perceives partial video clips multiple times, achieving repeated perception and providing correct answers.(c) Benchmark performance: Comparative results on 7 benchmarks highlight the video reasoning capabilities of our model, at both 3B and 7B parameter scales.
  • Figure 2: The pipeline for hallucination preference data and cold start data. (a) The figure above shows the construction method of hallucination data. By re-annotation, five types of hallucinated captions are generated. Then, Qwen 2.5VL-7B is used to automatically generate CoT data, resulting in the Factual-Aware Evaluator. (b) The figure below shows the pipeline for building cold start data following the PLR paradigm. It mainly demonstrates the method of reannotating NextQA, using carefully designed prompts to enable the model to generate inference data containing perceptual loops.
  • Figure 3: Perception Number
  • Figure 4: RL data type
  • Figure 6: Negative Captions
  • ...and 33 more figures