Table of Contents
Fetching ...

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

TL;DR

Shuffle-R1 targets two RL training inefficiencies in multimodal LLM fine-tuning: Advantage Collapsing and Rollout Silencing. It introduces Pairwise Trajectory Sampling to create informative contrastive pairs and Advantage-based Batch Shuffle to reshape batches toward high-utility rollouts, achieving improved data efficiency with minimal overhead. Empirical results across geometry, math reasoning, and multimodal benchmarks show consistent gains over strong baselines and competitive performance against leading closed models, while requiring fewer training steps. The work highlights data-centric adaptations as a crucial lever for efficient RL in multimodal large language models.

Abstract

Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

TL;DR

Shuffle-R1 targets two RL training inefficiencies in multimodal LLM fine-tuning: Advantage Collapsing and Rollout Silencing. It introduces Pairwise Trajectory Sampling to create informative contrastive pairs and Advantage-based Batch Shuffle to reshape batches toward high-utility rollouts, achieving improved data efficiency with minimal overhead. Empirical results across geometry, math reasoning, and multimodal benchmarks show consistent gains over strong baselines and competitive performance against leading closed models, while requiring fewer training steps. The work highlights data-centric adaptations as a crucial lever for efficient RL in multimodal large language models.

Abstract

Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

Paper Structure

This paper contains 39 sections, 13 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: (a): Advantage Collapsing, where most advantages concentrate near zero. (b): Rollout Silencing, where the ratio of rollouts with non-zero gradient consistently drops. The phenomenon gets worse in larger models.
  • Figure 2: (a) Model accuracy improves with larger rollout sizes. (b) Queries with different difficulties demonstrate varying accuracy during training, their corresponding rollouts have different diversity and qualities consequently.
  • Figure 3: Overview of our proposed Shuffle-R1. After advantage calculation, we first conduct Pairwise Trajectory Sampling to obtain valuable trajectory pairs from original rollout pool, then perform Advantage-based Batch Shuffle to reshape the distribution of valid trajectories in a batch.
  • Figure 4: Advantage distribution in a training batch of GRPO and our framework.
  • Figure 5: (a): Training accuracy of GRPO and Shuffle-R1. (b): Validation accuracy of GRPO and Shuffle-R1. (c): Token utilization rate of GRPO and Shuffle-R1. (d): Shuffle-R1 achieves better performance with minimal extra time cost.
  • ...and 7 more figures