Table of Contents
Fetching ...

Spotlight on Token Perception for Multimodal Reinforcement Learning

Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng

TL;DR

This work defines token perception as a metric to analyze visual grounding in multimodal RLVR for LVLMs and reveals that visual dependency is sparse across tokens while reasoning trajectories vary in perceptual grounding. It introduces Visually-Perceptive Policy Optimization (VPPO), which reweights trajectory advantages by average visual dependency and focuses gradient updates on perceptually pivotal tokens through Token-level Gradient Filtering (TGF) and Trajectory-level Advantage Shaping (TAS). VPPO builds on Group Relative Policy Optimization (GRPO) and employs a KL-divergence based visual dependency measure with Random Patch Blackening perturbations to quantify information gain from vision. Across eight perception and reasoning benchmarks on 7B and 32B Qwen2.5-VL models, VPPO achieves substantial gains, faster convergence, and better stability, demonstrating that perception-aware signal modulation is a powerful driver for multimodal reasoning performance.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Spotlight on Token Perception for Multimodal Reinforcement Learning

TL;DR

This work defines token perception as a metric to analyze visual grounding in multimodal RLVR for LVLMs and reveals that visual dependency is sparse across tokens while reasoning trajectories vary in perceptual grounding. It introduces Visually-Perceptive Policy Optimization (VPPO), which reweights trajectory advantages by average visual dependency and focuses gradient updates on perceptually pivotal tokens through Token-level Gradient Filtering (TGF) and Trajectory-level Advantage Shaping (TAS). VPPO builds on Group Relative Policy Optimization (GRPO) and employs a KL-divergence based visual dependency measure with Random Patch Blackening perturbations to quantify information gain from vision. Across eight perception and reasoning benchmarks on 7B and 32B Qwen2.5-VL models, VPPO achieves substantial gains, faster convergence, and better stability, demonstrating that perception-aware signal modulation is a powerful driver for multimodal reasoning performance.

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.

Paper Structure

This paper contains 61 sections, 2 theorems, 36 equations, 15 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.1

The variance of the VPPO estimator is approximately related to the GRPO estimator by the following expression:

Figures (15)

  • Figure 1: Our VPPO framework explicitly relies on token visual dependency to shape trajectory advantages and filter token gradients.
  • Figure 2: Overview of our VPPO framework. Given the original and masked image inputs, we first obtain the corresponding output distributions. Then, we compute a token-level visual dependency score for each trajectory. Subsequently, these token-level scores are used to generate two hierarchical control signals: at the macro-level, they are averaged into a trajectory-level dependency to shape the advantage, while at the micro-level, the top-$k$% tokens are identified to create a sparse binary token gradient mask. In this way, the uniform advantage is transformed into a fine-grained, targeted learning signal for the final policy update.
  • Figure 3: The skewed distribution of token-level visual dependency.
  • Figure 4: Distribution of trajectory dependency on perception.
  • Figure 5: Training dynamics for VPPO and baselines.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Definition 3.1: Token-level visual dependency
  • Theorem 3.1: Variance Reduction
  • Theorem C.1
  • proof