Table of Contents
Fetching ...

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

TL;DR

<3-5 sentence high-level summary> PAPO addresses a key bottleneck in multimodal reasoning: perception failures limit end-to-end reasoning on visual inputs. By introducing Implicit Perception Loss (KL_prcp) between the policy conditioned on full vision and a masked-vision variant, and stabilizing training with Double Entropy regularization, PAPO jointly improves perception and reasoning without extra data or reward models. Empirically, PAPO yields consistent gains across eight multimodal benchmarks, with larger improvements on vision-dependent tasks and a substantial reduction in perception errors. The work also analyzes failure modes (KL_prcp hacking) and offers principled regularization to maintain stable learning, highlighting a path toward visually grounded RLVR for LMMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Perception-Aware Policy Optimization for Multimodal Reasoning

TL;DR

<3-5 sentence high-level summary> PAPO addresses a key bottleneck in multimodal reasoning: perception failures limit end-to-end reasoning on visual inputs. By introducing Implicit Perception Loss (KL_prcp) between the policy conditioned on full vision and a masked-vision variant, and stabilizing training with Double Entropy regularization, PAPO jointly improves perception and reasoning without extra data or reward models. Empirically, PAPO yields consistent gains across eight multimodal benchmarks, with larger improvements on vision-dependent tasks and a substantial reduction in perception errors. The work also analyzes failure modes (KL_prcp hacking) and offers principled regularization to maintain stable learning, highlighting a path toward visually grounded RLVR for LMMs.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

Paper Structure

This paper contains 53 sections, 10 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Comprehensive error-type breakdown and inference example between GRPO and PAPO. We observe that perception errors account for the majority (67%) of failures in current multimodal reasoning models trained with GRPO. PAPO significantly reduces the dominant perception-driven errors by 30.5%, with the reduced portion indicated in gray. On the right, we present a representative inference example that illustrates how PAPO’s enhanced perception enables correct reasoning outcomes.
  • Figure 2: Illustration of the PAPO$_G$ objective, which extends GRPO by adding the Implicit Perception Loss (KL$_{\text{prcp}}$). Additional Double Entropy Loss regularization ($H[\pi_\theta]$, $H[\pi_\theta^{mask}]$) can be added for enhancing training stabilities. The KL$_{\text{prcp}}$ is formulated as maximizing the difference between the original policy $\pi_{\theta}$ and a corrupted policy $\pi_{\theta}^\text{mask}$, computed with a masked visual input. Intuitively, PAPO encourages the model to produce visually grounded responses while still achieving high rewards.
  • Figure 3: Comparison of the training dynamics on the accuracy reward. Solid lines indicate running averages with a stepping window size of 20. PAPO demonstrates consistently faster learning from the early stages on both GRPO and DAPO. Notably, DAPO-7B suffers from model collapse in the later stages, whereas PAPO$_D$ achieves continued improvements without collapse, highlighting the effectiveness of the proposed Double Entropy regularization. Further analysis on regularizing the DAPO baseline is presented in Appendix \ref{['app:dapo_baseline_w_ent']}.
  • Figure 4: Impact of masking strategy and ratio. Performance comparison of PAPO$_G$ using different approaches for constructing $I_{\text{mask}}$. Despite its simplicity, random masking empirically outperforms semantic-aware masking. A sufficiently large masking ratio (e.g., 0.6) yields stronger performance, while ratios that are too low (e.g., 0.4) or too high (e.g., 1.0) are less effective. See details in §\ref{['subsec:ablation_design_choices']}.
  • Figure 5: Impact of KL$_{\text{prcp}}$ loss weighting. Performance comparison on PAPO$_G$-3B using different values of $\gamma$. Increasing $\gamma$ up to 0.02 generally improves performance, while an excessively large $\gamma$, such as 0.04, leads to model collapse (see detailed discussion in §\ref{['subsec:analysis_hacking']}). Larger models are also more sensitive to high $\gamma$ as shown in Figure \ref{['fig:impact_factors_to_collapse']}.
  • ...and 12 more figures