Table of Contents
Fetching ...

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Peter Tu, Jing Zhang

Abstract

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Abstract

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

Paper Structure

This paper contains 12 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The failure cases of RL models. We use Vision-R1 huang2025vision as a representative RL model, with its corresponding base model being Qwen2.5-VL-7B bai2025qwen2. The examples are selected from MathVerse zhang2024mathverse and MathVista lu2023mathvista, respectively. For each question, we set the sampling temperature to $1.0$ and generate multiple responses, each of which is displayed in a gray box. Main differences in the proposed reasoning strategies are annotated in blue and pink, while correct and incorrect answers are highlighted in green and red, respectively.
  • Figure 2: The impact of reasoning diversity on model performance. In (A) and (B), we report acc@k for both RL and base models on established benchmarks, with color intensity decreasing as $k=1,2,4$. In (C) and (D), we plot relationship between reasoning diversity and the corresponding acc@4 scores. Each point is based on a set of $4$ responses, and a regression line is fitted to capture the overall trend.
  • Figure 3: The diversity collapse of GRPO. In (A), we plot the evolution of reasoning diversity across training steps. In (B), we present an illustration of policy distribution over training to highlight the contrasting dynamics of convergent and divergent thinking. The gray region denotes rewards associated with different reasoning trajectories, while the blue curve indicates corresponding sampling probabilities.
  • Figure 4: The t-SNE projection of reasoning embeddings. We analyze a successful case where RL models produce correct answers, and a failure case where they cannot despite multiple samplings.
  • Figure 5: The overview of MUPO. The upper part illustrates the high-level pipeline, where responses are partitioned into multiple groups, and the overall optimization objective is formulated as a composition of multiple GRPO objectives, each corresponding to a group. In the lower part, we present the advantage computation for a group, in which we introduce diversity reward to encourage inter-group separation.
  • ...and 4 more figures