Table of Contents
Fetching ...

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Zhixiong Zeng, Siqi Yang, Peng Shi, Lin Ma, Jing Zhang

TL;DR

PEARL tackles the core problem that RLVR-based multimodal reasoning methods verify only final text, ignoring upstream visual perception and enabling reward hacking. It introduces a dual-path reinforcement learning framework that uses a perception checklist to generate verifiable perception rewards and gates reasoning updates, resulting in a perception-grounded training loop. Through perception-oriented rollouts, gating, and a dual-objective optimization, PEARL achieves consistent gains across diverse multimodal reasoning benchmarks and scales with model size, while reducing training cost relative to strong baselines. The findings emphasize that robust, perceptually grounded reasoning is achievable with simple, task-aligned perception probes, suggesting a practical path toward more reliable vision-language reasoning systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist -- a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.

Perceptual-Evidence Anchored Reinforced Learning for Multimodal Reasoning

TL;DR

PEARL tackles the core problem that RLVR-based multimodal reasoning methods verify only final text, ignoring upstream visual perception and enabling reward hacking. It introduces a dual-path reinforcement learning framework that uses a perception checklist to generate verifiable perception rewards and gates reasoning updates, resulting in a perception-grounded training loop. Through perception-oriented rollouts, gating, and a dual-objective optimization, PEARL achieves consistent gains across diverse multimodal reasoning benchmarks and scales with model size, while reducing training cost relative to strong baselines. The findings emphasize that robust, perceptually grounded reasoning is achievable with simple, task-aligned perception probes, suggesting a practical path toward more reliable vision-language reasoning systems.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) and is now being applied to Vision-Language Models (VLMs). However, vanilla RLVR for VLMs verifies only the final textual output, critically neglecting the foundational step of visual perception. This oversight leads to visual hallucinations and reward hacking, as reasoning built upon flawed perception is inherently unreliable. To address this, we propose PEARL (Perceptual-Evidence Anchored Reinforced Learning), a dual-branch, perception-reasoning synergistic that strengthens multimodal reasoning by explicitly anchoring it to verified visual evidence. For each reasoning-oriented QA instance, PEARL first derive a perception checklist -- a set of perception-oriented sub-questions with verifiable answers that probe the model's understanding of key visual evidence. During training, auxiliary rollouts on this checklist yield a perceptual reward that both directly reinforces the model's perception ability and acts as a fidelity gate for reasoning. If the model passes the perception check, its policy update is biased towards evidence-anchored reasoning. Otherwise, the process is halted to prevent reasoning from flawed premises. PEARL can be seamlessly integrated with popular RL methods like GRPO and DAPO. Comprehensive experiments show PEARL achieves substantial gains on multimodal reasoning benchmarks, e.g., a +9.7% improvement over the baseline and +6.6% over GRPO on MathVerse.

Paper Structure

This paper contains 31 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: We present PEARL, a perception–reasoning synergistic RL paradigm that strengthens VLMs' reasoning by explicitly anchoring in perceptual evidence, and demonstrates outstanding performance across multiple multimodal reasoning benchmarks.
  • Figure 2: Comparison of training dynamics and failure modes. Standard outcome-based GRPO reduces reasoning errors but fails to address fundamental perception errors, leading to spurious reasoning chains based on flawed visual understanding. In contrast, our proposed PEARL achieves simultaneous improvements in both perception and reasoning, leading to a marked reduction in both error types and enabling more reliable problem-solving.
  • Figure 3: Representative sub-question/answer pairs produced under our guidelines from original image–QA instances. These task-aligned, image-grounded items provide high-quality, easily verifiable supervision and serve as perceptual scaffolds supporting reasoning.
  • Figure 4: Training cost comparison against DAPO yu2025dapo. Experiments conducted on 16 H800 NVIDIA GPUs.
  • Figure 5: Impact of Perception on Reasoning Success. (a) Samples with high perception scores exhibit significantly higher reasoning accuracy compared to low-perception regimes. (b)-(c) Visualization of the conditional probability $\mathbb{P}(\text{Reasoning}=1 \mid \text{Perception})$ for the Base model and PEARL, respectively. Both exhibit a strong positive correlation, confirming that accurate perception is a strong predictor of reasoning success.
  • ...and 4 more figures