Table of Contents
Fetching ...

PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau

TL;DR

This work tackles the difficulty of multi-step visual reasoning in Vision-Language Models by introducing PROPA, a framework that couples Monte Carlo Tree Search with Group Relative Policy Optimization to produce dense, process-level rewards and optimize reasoning at intermediate steps. It further mitigates cold-start through interleaved GRPO and SFT training and introduces a Process Reward Model to guide test-time search, aligning inference with training signals. Across seven benchmarks and four backbones, PROPA delivers consistent improvements over SFT- and RL-based baselines, with up to 21.0% gains on out-of-domain tasks, demonstrating stronger reasoning and generalization. The study provides extensive ablations, analysis of data transitions during training, and case studies, and releases its code for reproducibility.

Abstract

Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

TL;DR

This work tackles the difficulty of multi-step visual reasoning in Vision-Language Models by introducing PROPA, a framework that couples Monte Carlo Tree Search with Group Relative Policy Optimization to produce dense, process-level rewards and optimize reasoning at intermediate steps. It further mitigates cold-start through interleaved GRPO and SFT training and introduces a Process Reward Model to guide test-time search, aligning inference with training signals. Across seven benchmarks and four backbones, PROPA delivers consistent improvements over SFT- and RL-based baselines, with up to 21.0% gains on out-of-domain tasks, demonstrating stronger reasoning and generalization. The study provides extensive ablations, analysis of data transitions during training, and case studies, and releases its code for reproducibility.

Abstract

Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

Paper Structure

This paper contains 30 sections, 5 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Example outputs of our PROPA framework compared with RFT (Reinforcement Fine-Tuning using GRPO)
  • Figure 2: Overview of the proposed PROPA framework. The architecture integrates MCTS-guided process-level reward generation, an interleaved GRPO training scheme, and a learned PRM for test-time inference.
  • Figure 3: Accuracy over epochs across domains and baselines. Gray region corresponds to the SFT-activation stage, while blue region represent the training stage.
  • Figure 4: Transition of GRPO and SFT data proportions over training steps for Qwen2.5-VL-3B (top row) and Intern2.5-VL-2B (bottom row) across all ID datasets. x-axis is the training steps, y axis left represents number of SFT instances, where y axis right represents number of GRPO instances.
  • Figure 5: Example visualization of datasets used in our benchmark. We include three reasoning categories: (a) Mathematical and Scientific Reasoning, (b) Spatial Reasoning, and (c) Structure Perception. Each category contains both in-domain and out-of-domain datasets, highlighting the diversity and reasoning complexity across tasks.