Table of Contents
Fetching ...

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang

TL;DR

PeBR-R1 introduces a two-stage reinforcement learning framework to improve both perception and reasoning in vision-language models. The method uses dataset-level sampling to prevent vanishing advantages, a warm-up supervised phase, and GRPO-based optimization across perception and reasoning stages. It achieves strong improvements across seven multimodal benchmarks, outperforming several open-source baselines and rivaling some closed-source systems. The work demonstrates that decoupling perception and reasoning optimization with targeted rewards yields robust grounded multimodal reasoning.

Abstract

Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

TL;DR

PeBR-R1 introduces a two-stage reinforcement learning framework to improve both perception and reasoning in vision-language models. The method uses dataset-level sampling to prevent vanishing advantages, a warm-up supervised phase, and GRPO-based optimization across perception and reasoning stages. It achieves strong improvements across seven multimodal benchmarks, outperforming several open-source baselines and rivaling some closed-source systems. The work demonstrates that decoupling perception and reasoning optimization with targeted rewards yields robust grounded multimodal reasoning.

Abstract

Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

Paper Structure

This paper contains 31 sections, 14 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Performance comparison of PeBR-R1 with existing open-source VLMs.
  • Figure 2: (a) Overview of question sampling and categorization. Each question is passed to the vision-language model (VLM) for n independent rollouts. Based on the number of correct responses, questions are categorized into three types: Easy cases (all correct), Medium cases (partially correct), and Hard cases (all incorrect). (b) Overview of the two-stage reinforcement learning framework. The model output is parsed into four components: Image Description (I), Rationale (R), Step-by-step Thinking ($S^1$ to $S^k$), and Final Answer (A). In Stage 1: Perception RL, we use Easy cases to train the warm-up policy model by extracting image descriptions, computing rewards and advantages, and updating the policy to obtain the perception policy model. In Stage 2: Reasoning RL, we continue training the perception policy model on Medium cases using the final answer and output format to compute rewards and advantages, and optimize the policy to obtain the final PeBR-R1 model.
  • Figure 3: Perception RL Reward Computation. The keyword reward is computed by extracting keywords from teacher-generated image description to construct a reference keyword set, followed by calculating the proportion of matched keywords in the student-generated image description. The CLIP reward is obtained by evaluating the CLIP similarity between the student-generated image description and the corresponding input image, with the resulting score subsequently scaled.
  • Figure 4: Visualization of reward trends. Left: CLIP and keyword rewards during perception-stage training. Right: Accuracy and format rewards during reasoning-stage training.
  • Figure 5: Mean response length comparison with and without length penalty during Stage-1 RL.
  • ...and 5 more figures