Table of Contents
Fetching ...

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Ziqi Miao, Haonan Jia, Lijun Li, Chen Qian, Yuan Xiong, Wenting Yan, Jing Shao

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

Paper Structure

This paper contains 41 sections, 12 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Diagnostic analysis of GRPO on WeMath qiao2024wemath. Left: GRPO reduces reasoning errors much more than perception errors. Right: a representative failure case caused by incorrect perception.
  • Figure 2: Overview of PRCO. A shared policy alternates between an Observer for question-conditioned evidence captioning and a Solver for evidence-conditioned reasoning. The two roles are jointly optimized with role-specific learning signals and group-relative advantages, enabling perception--reasoning coevolution under a shared policy.
  • Figure 3: Training reward curves of PRCO and its role-ablation variants with Qwen2.5-VL-3B and Qwen2.5-VL-7B as backbones.
  • Figure 4: Pass@$k$ comparison on WeMath and MMStar for PRCO-7B, DAPO-7B, and VPPO-7B under different inference-time sampling budgets.
  • Figure 5: Error category analysis on WeMath and MathVista. Compared with Qwen2.5-VL-7B, PRCO-7B reduces both perception and reasoning errors. For presentation clarity, Knowledge and Extraction errors are grouped into the Other category.
  • ...and 10 more figures