Table of Contents
Fetching ...

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

Henghui Du, Chang Zhou, Xi Chen, Di Hu

TL;DR

This work proposes APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception, and demonstrates APPO consistently outperforms GRPO and DAPO.

Abstract

Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.

APPO: Attention-guided Perception Policy Optimization for Video Reasoning

TL;DR

This work proposes APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception, and demonstrates APPO consistently outperforms GRPO and DAPO.

Abstract

Complex video reasoning, actually, relies excessively on fine-grained perception rather than on expert (e.g., Ph.D, Science)-level reasoning. Through extensive empirical observation, we have recognized the critical impact of perception. In particular, when perception ability is almost fixed, enhancing reasoning from Qwen3-8B to OpenAI-o3 yields only 0.7% performance improvement. Conversely, even minimal change in perception model scale (from 7B to 32B) boosts performance by 1.4%, indicating enhancing perception, rather than reasoning, is more critical to improve performance. Therefore, exploring how to enhance perception ability through reasoning without the need for expensive fine-grained annotation information is worthwhile. To achieve this goal, we specially propose APPO, the Attention-guided Perception Policy Optimization algorithm that leverages token-level dense rewards to improve model's fine-grained perception. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frame (called intra-group perception tokens). Experimental results on diverse video benchmarks and models with different scales (3/7B) demonstrate APPO consistently outperforms GRPO and DAPO (0.5%~4%). We hope our work provides a promising approach to effectively enhance model's perception abilities through reasoning in a low-cost manner, serving diverse scenarios and demands.
Paper Structure (24 sections, 12 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 12 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: We present APPO, the Attention-guided Perception Policy Optimization algorithm that enhances model's fine-grained perception ability through reasoning. The core idea behind APPO is to optimize those tokens from different responses that primarily focus on the same crucial video frames (called intra-group perception tokens), resulting in fine-grained token level reward signals. Left: The illustration of APPO algorithm. The intra-group perception tokens are defined as those tokens from different responses that primarily focus on the same crucial video frame. The perception tokens within each group are optimized with different learning intensities. Right: Experimental results on multiple video benchmarks demonstrate APPO achieves overall performance improvement compared with GRPO and DAPO.
  • Figure 2: The Perception-Reasoning curves on SEED-Bench-R1 chen2025exploring and Perception-Test patraucean2023perception benchmarks, quantifying the impact of perception vs. reasoning ability on overall performance. Each point in the curve represents the performance achieved by combining specific perception and reasoning ability. In particular, we first prompted four perception models with progressively enhanced abilities (including Qwen2.5-VL-3/7/32B bai2025qwen2 and Gemini-2.0-flash comanici2025gemini) to describe video content in detail. Subsequently, the other four reasoning models with varying capabilities (including Qwen3-4/8B, Qwen3-235-A22B-thinking yang2025qwen3, and OpenAI-o3 jaech2024openai) were used to think and answer questions based on the descriptions provided by each perception model, respectively, yielding $4 \times 4$ cross-combination results. (a) For SEED-Bench-R1 benchmark, we evaluate on $2K$ Level-1 samples. (b) For Perception Test benchmark, we randomly select $1K$ samples from different videos for evaluation. (c) The performance comparison of GRPO, DAPO and our APPO on SEED-Bench-R1 benchmark across different scales models, demonstrating the significant improvements brought by enhanced perception.
  • Figure 3: The overview of APPO algorithm, which primarily consists of two core steps: attention-guided frame selection and intra-group perception tokens re-weighting. Firstly, a group of $G$ responses are divided into two sets based on the reward scores, and the final target frames are selected from these two sets based on attention weights. Building upon these frames, those tokens from different responses focusing on the same frame are assigned into a group and optimized with token-level weights.
  • Figure 4: The illustration of intra-group perception tokens. The tokens within the same group primarily focus on the same crucial video frame.
  • Figure 5: The comparison of generation entropy, grad norm and reward scores during training process with GRPO and DAPO.
  • ...and 8 more figures