Table of Contents
Fetching ...

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

Rui Liu, Dian Yu, Lei Ke, Haolin Liu, Yujun Zhou, Zhenwen Liang, Haitao Mi, Pratap Tokekar, Dong Yu

TL;DR

This work tackles the efficiency-stability dilemma in multimodal RLVR by introducing MSSR, a group-free single-rollout approach that uses a Beta-based baseline for Bernoulli rewards and entropy-based advantage shaping to prevent collapse. MSSR achieves comparable validation performance to strong group-based baselines with roughly half the training steps and demonstrates better generalization across five diverse multimodal reasoning benchmarks. Ablation studies show that entropy-based shaping is central to stability, outperforming alternative regularization strategies. Overall, MSSR offers a scalable, stable, and compute-efficient solution for training multimodal reasoning models with verifiable rewards.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Stable and Efficient Single-Rollout RL for Multimodal Reasoning

TL;DR

This work tackles the efficiency-stability dilemma in multimodal RLVR by introducing MSSR, a group-free single-rollout approach that uses a Beta-based baseline for Bernoulli rewards and entropy-based advantage shaping to prevent collapse. MSSR achieves comparable validation performance to strong group-based baselines with roughly half the training steps and demonstrates better generalization across five diverse multimodal reasoning benchmarks. Ablation studies show that entropy-based shaping is central to stability, outperforming alternative regularization strategies. Overall, MSSR offers a scalable, stable, and compute-efficient solution for training multimodal reasoning models with verifiable rewards.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Paper Structure

This paper contains 28 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Performance overview of MSSR: (a–b) Training and validation accuracy of MVSR (Multimodal Vanilla Single-Rollout), GRPO shao2024deepseekmath and our MSSR, trained on the Vision-R1-RL huang2025vision training set and validated on its corresponding validation set. MSSR remains stable and improves steadily, whereas MVSR is unstable and collapses. Notably, MSSR reaches a similar final validation accuracy to GRPO with half of the training steps, highlighting its superior training compute efficiency. (c) Our MSSR achieves higher generalization performance across diverse multimodal reasoning benchmarks, including MathVerse zhang2024mathverse, MathVista lu2023mathvista, MMK12 meng2025mm, R1-Onevision-Bench yang2025r1, and HallusionBench guan2024hallusionbench, compared to other baselines including GRPO shao2024deepseekmath, RLOO ahmadian2024back, and REINFORCE++ hu2025reinforce++. For fair comparisons, we have equivalent total number of rollouts per step for all methods.
  • Figure 2: Overview of the proposed MSSR approach. Given a multimodal input, i.e., an image and the corresponding question, we generate a single rollout through the policy model. We then use a Beta distribution to estimate the baseline value $v$, compute the advantage $A$, and normalize it across the batch. Finally, we propose entropy-based advantage shaping to preserve entropy and stabilize training.
  • Figure 3: Model output entropy during training with Qwen2.5-VL-7B. MVSR (multimodal vanilla single-rollout) suffers from entropy collapse as training progresses, whereas our proposed MSSR (multimodal stabilized single-rollout) preserves entropy.
  • Figure 4: Ablation studies on effectiveness of techniques for preventing entropy collapse and stabilizing multimodal single-rollout training.Cross-modal regularization: This technique provides partial stabilization, increasing training accuracy but still resulting in degraded validation accuracy, and both metrics remain below those achieved by MSSR. Entropy loss: Adding an entropy loss term partially preserves entropy and improves training accuracy toward the end of training, but validation performance still degrades and entropy is not maintained as effectively as in MSSR.
  • Figure 5: Comparison of reasoning outputs from GRPO and MSSR (Multimodal Stabilized Single-Rollout). MSSR produces the correct answer while GRPO fails. We highlight the critical reasoning steps that lead to GRPO’s incorrect answer in red, and the key steps enabling MSSR’s correct prediction in green.
  • ...and 5 more figures