Table of Contents
Fetching ...

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

Abstract

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Abstract

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.

Paper Structure

This paper contains 83 sections, 2 theorems, 42 equations, 8 figures, 7 tables.

Key Result

Theorem E.1

Under Assumptions 1–4, PGPO strictly suppresses the variance contribution of nuisance tokens by a factor of $\varepsilon^2$ compared to standard GRPO, where $\varepsilon \ll 1$ represents the suppression floor of the advantage reshaping function. $\blacktriangleleft$$\blacktriangleleft$

Figures (8)

  • Figure 1: Unlike standard uniform credit assignment, our proposed method dynamically allocates higher reinforcement learning advantage to pivotal tokens that heavily rely on visual perception.
  • Figure 2: Empirical analysis results of visual dependency. (a) The skewed distribution of token-level visual dependency. (b,c,d) Average visual dependency comparison between specific category tokens and others.
  • Figure 3: Overview of our proposed PGPO framework. The PGPO pipeline begins by quantifying token visual dependency$\mathcal{S}_t$ via KL divergence to isolate the causal information gain from visual inputs. These raw dependency signals are then transformed into bounded token visual dependency score$I_t$ through logarithmic compression and min-max normalization. Finally, a threshold-gated mechanism dynamically reshapes the sequence-level GRPO advantage—amplifying learning signals for visually-dependent tokens and suppressing modality-independent noise—while applying a sum-preserving normalization to guarantee stable policy optimization.
  • Figure 4: Training dynamics on Qwen2.5-VL-7B.
  • Figure 5: Top 200 $\mathcal{S}$ tokens word cloud on all generated tokens of vision-dominant MathVerse.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Theorem E.1: PGPO Noise Suppression
  • proof
  • Proposition F.1: Covariance inflation due to a mean shift
  • proof
  • proof
  • proof