Stable and Efficient Single-Rollout RL for Multimodal Reasoning

Rui Liu; Dian Yu; Lei Ke; Haolin Liu; Yujun Zhou; Zhenwen Liang; Haitao Mi; Pratap Tokekar; Dong Yu

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu

TL;DR

This work tackles the efficiency-stability dilemma in multimodal RLVR by introducing MSSR, a group-free single-rollout approach that uses a Beta-based baseline for Bernoulli rewards and entropy-based advantage shaping to prevent collapse. MSSR achieves comparable validation performance to strong group-based baselines with roughly half the training steps and demonstrates better generalization across five diverse multimodal reasoning benchmarks. Ablation studies show that entropy-based shaping is central to stability, outperforming alternative regularization strategies. Overall, MSSR offers a scalable, stable, and compute-efficient solution for training multimodal reasoning models with verifiable rewards.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

TL;DR

Abstract

(Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

TL;DR

Abstract

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)