Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu; Yiran Guan; Dingkang Liang; Jianzhong Ju; Zhenbo Luo; Bin Qin; Jian Luan; Yuliang Liu; Xiang Bai

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

TL;DR

Shuffle-R1 targets two RL training inefficiencies in multimodal LLM fine-tuning: Advantage Collapsing and Rollout Silencing. It introduces Pairwise Trajectory Sampling to create informative contrastive pairs and Advantage-based Batch Shuffle to reshape batches toward high-utility rollouts, achieving improved data efficiency with minimal overhead. Empirical results across geometry, math reasoning, and multimodal benchmarks show consistent gains over strong baselines and competitive performance against leading closed models, while requiring fewer training steps. The work highlights data-centric adaptations as a crucial lever for efficient RL in multimodal large language models.

Abstract

Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2) Advantage-based Trajectory Shuffle, which increases exposure of valuable rollouts through informed batch reshuffling. Experiments across multiple reasoning benchmarks show that our framework consistently outperforms strong RL baselines with minimal overhead. These results highlight the importance of data-centric adaptations for more efficient RL training in MLLM.

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

TL;DR

Abstract

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)