PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning
Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau
TL;DR
This work tackles the difficulty of multi-step visual reasoning in Vision-Language Models by introducing PROPA, a framework that couples Monte Carlo Tree Search with Group Relative Policy Optimization to produce dense, process-level rewards and optimize reasoning at intermediate steps. It further mitigates cold-start through interleaved GRPO and SFT training and introduces a Process Reward Model to guide test-time search, aligning inference with training signals. Across seven benchmarks and four backbones, PROPA delivers consistent improvements over SFT- and RL-based baselines, with up to 21.0% gains on out-of-domain tasks, demonstrating stronger reasoning and generalization. The study provides extensive ablations, analysis of data transitions during training, and case studies, and releases its code for reproducibility.
Abstract
Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.
