Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Yunheng Li; Hangyi Kuang; Hengrui Zhang; Jiangxia Cao; Zhaojie Liu; Qibin Hou; Ming-Ming Cheng

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Yunheng Li, Hangyi Kuang, Hengrui Zhang, Jiangxia Cao, Zhaojie Liu, Qibin Hou, Ming-Ming Cheng

Abstract

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Abstract

Paper Structure (19 sections, 12 equations, 6 figures, 13 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 6 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Methodology
Background and Motivation
Token-Level Analysis of Multimodal Reasoning
Perception-Exploration Policy Optimization
Experiments
Experiment Setup
Main Results
Ablation Study
Qualitative Comparisons
Conclusions
Implementation Details
Prompt and Reward Design
Evaluation
...and 4 more sections

Figures (6)

Figure 1: Overview of PEPO. (a) Effective multimodal reasoning arises from the complementarity between perception and exploration. Abbr. Exp.: Exploration-only, Per.: Perception-only, P+E: Perception + Exploration. (b) Unlike traditional sequence-level optimization with uniform advantages, PEPO reweights tokens using a perception prior from visual similarity and entropy via a smooth gate, producing fine-grained token-level advantages. (c) When integrated with GRPO or DAPO, PEPO consistently improves performance across diverse benchmarks.
Figure 2: Distributions of different visual similarity metrics comparing correct and incorrect responses. (a) The global similarity ($M_{\text{glob}}$) across all tokens, where correct responses exhibit a clear rightward shift. (b) The top-$K$ similarity ($M_{\text{high}}$), where the correct-response peak also moves right. (c) The bottom-$K$ similarity ($M_{\text{low}}$), where the shift is negligible. Together, these results show that reasoning correctness is characterized by a subset of visual-grounded tokens.
Figure 3: Token-level analysis of visual similarity and entropy. (a) High visual similarity tokens exhibit larger hidden-state shifts under image removal than high entropy tokens. (b) Word cloud of high entropy tokens and (c) word cloud of high visual similarity tokens, illustrating reasoning-related and perceptual terms.
Figure 4: Framework of PEPO. During response generation, the layer-wise hidden states of response tokens and vision tokens are extracted, along with the output logits. For each response token, visual similarity and entropy are computed, and the centered sum of their normalized values is passed through a smooth gating function to produce token-wise weights that modulate the advantages for PEPO updates.
Figure 5: Qualitative comparison on Geometry3K, MathVerse, and LISA datasets. The GRPO-trained model exhibits perception failures and inconsistent reasoning, leading to incorrect answers. In contrast, the PEPO-trained model generates coherent, visually grounded reasoning chains that produce correct results, demonstrating the effectiveness of PEPO in enhancing multimodal reasoning.
...and 1 more figures

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Abstract

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

Authors

Abstract

Table of Contents

Figures (6)