Table of Contents
Fetching ...

VIDEOP2R: Video Understanding from Perception to Reasoning

Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan

TL;DR

VideoP2R addresses the challenge of video reasoning by introducing a process-aware reinforcement fine-tuning framework that separates perception and reasoning. It first builds VideoP2R-CoT-162K, a three-step, process-aware CoT dataset for SFT, then applies PA-GRPO, a two-reward RL algorithm that provides distinct supervision for perception and reasoning. The approach yields state-of-the-art results on six of seven video understanding/reasoning benchmarks, with robust ablations confirming the value of process separation and dual rewards, and mitigating issues like think–answer mismatch. The work demonstrates that decoupling perception from reasoning not only improves accuracy but also provides clearer credit assignment and generalizable insights for future video-language understanding systems.

Abstract

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

VIDEOP2R: Video Understanding from Perception to Reasoning

TL;DR

VideoP2R addresses the challenge of video reasoning by introducing a process-aware reinforcement fine-tuning framework that separates perception and reasoning. It first builds VideoP2R-CoT-162K, a three-step, process-aware CoT dataset for SFT, then applies PA-GRPO, a two-reward RL algorithm that provides distinct supervision for perception and reasoning. The approach yields state-of-the-art results on six of seven video understanding/reasoning benchmarks, with robust ablations confirming the value of process separation and dual rewards, and mitigating issues like think–answer mismatch. The work demonstrates that decoupling perception from reasoning not only improves accuracy but also provides clearer credit assignment and generalizable insights for future video-language understanding systems.

Abstract

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

Paper Structure

This paper contains 43 sections, 7 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Comparison between GRPO-based video RFT framework (process-agnostic) and VideoP2R (process-aware).
  • Figure 2: Illustration of overall VideoP2R RFT framework (left) and the three-step CoT generation pipeline (right).
  • Figure 3: The illustration of the PA-GRPO algorithm.
  • Figure 4: Effect of perception on downstream reasoning
  • Figure 5: Success (Left) and Failure (Right) case of VideoP2R. Correct statement and incorrect statement are colored.
  • ...and 16 more figures