R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

Huanjin Yao; Qixiang Yin; Jingyi Zhang; Min Yang; Yibo Wang; Wenhao Wu; Fei Su; Li Shen; Minghui Qiu; Dacheng Tao; Jiaxing Huang

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang

TL;DR

Share-GRPO introduces information sharing into reinforcement learning for multimodal large language models to enhance long-chain reasoning. By expanding the question space with semantically consistent transformations and sharing reasoning trajectories and rewards across variants, it mitigates sparse rewards and advantage vanishing. The approach integrates global and local hierarchical advantage estimation and shared policy optimization, achieving superior results on six reasoning benchmarks for both 7B and 32B MLLMs. Empirical findings show robust gains in mathematical and general reasoning tasks, with ablations confirming the effectiveness of each component and complementary gains when combined with dynamic sampling.

Abstract

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

TL;DR

Abstract

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)