Table of Contents
Fetching ...

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang

TL;DR

This paper tackles the data inefficiency of reinforcement learning fine-tuning for large language models by introducing two complementary techniques: Difficulty-targeted Online Data Selection (DOTS) and Rollout Replay (RR). DOTS uses an attention-based adaptive difficulty predictor to prioritize mid-difficulty questions, enabling faster convergence with fewer training steps, while RR reuses recent rollouts to cut per-step computation and stabilize updates with an off-policy GRPO objective. The authors provide theoretical justification that sampling near 50% success rates maximizes gradient signal and demonstrate empirically that DOTS+RR reduces total RL fine-tuning time by 23%–62% across six LLM–dataset combinations without sacrificing final performance, achieving an average 40.7% cost reduction. The approach scales to large datasets and remains effective outside math-focused domains, indicating broad applicability for data-centric RL for LLMs.

Abstract

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.

Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

TL;DR

This paper tackles the data inefficiency of reinforcement learning fine-tuning for large language models by introducing two complementary techniques: Difficulty-targeted Online Data Selection (DOTS) and Rollout Replay (RR). DOTS uses an attention-based adaptive difficulty predictor to prioritize mid-difficulty questions, enabling faster convergence with fewer training steps, while RR reuses recent rollouts to cut per-step computation and stabilize updates with an off-policy GRPO objective. The authors provide theoretical justification that sampling near 50% success rates maximizes gradient signal and demonstrate empirically that DOTS+RR reduces total RL fine-tuning time by 23%–62% across six LLM–dataset combinations without sacrificing final performance, achieving an average 40.7% cost reduction. The approach scales to large datasets and remains effective outside math-focused domains, indicating broad applicability for data-centric RL for LLMs.

Abstract

Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.

Paper Structure

This paper contains 50 sections, 2 theorems, 23 equations, 10 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Consider a single question $q$, where $G$ responses $\{o_i\}_{i=1}^G$ are sampled independently from the current policy $\pi_\theta(\cdot \mid q)$. Each response receives a binary reward $r_i \in \{0, 1\}$, sampled i.i.d. from a Bernoulli$(p)$ distribution, where $p$ represents the reward success ra and is maximized when $p = 0.5$.

Figures (10)

  • Figure 1: Overview of our framework combining Difficulty-targeted Online Data Selection and Rollout Replay. At each training step, the online data selection module selects training questions with adaptive difficulty near 0.5, requiring rollouts only on a small reference set (§\ref{['sec:pred_framework']}, §\ref{['sec:method_dots']}). The rollout replay module combines current rollouts with retrieved recent rollouts from a FIFO buffer, and the current rollouts are stored into the buffer for future use (§\ref{['sec:method_rollout']}).
  • Figure 2: Illustration of our attention-based adaptive difficulty prediction framework. For each unlabeled question, we compute its embedding and attend to reference questions to obtain similarity scores. The predicted difficulty of the unlabeled question is obtained by computing an attention-weighted average, where similarities to reference questions serve as attention scores over their associated difficulties. In this example, the unlabeled question involves inverse trigonometric functions. The model assigns high attention to a reference question that tests a closely related concept and has a difficulty of 1.0. As a result, the predicted difficulty is also close to 1.0. All difficulty values shown correspond to adaptive difficulty scores computed at the same step.
  • Figure 3: Average accuracy curves of our method and original GRPO under various LLM–dataset combinations. The curves show average performance aggregated over four benchmarks with exponential smoothing for visualization. The error bars represent 95% confidence intervals across $3$ independent runs. Although both methods are trained for the same number of steps (60), our curve is shorter in duration because RR reduces the wall-clock time per step. Our method consistently outperforms the original GRPO throughout training and reduces the time required to match the original GRPO's final accuracy after 60 training steps by an average of 40.7%.
  • Figure 4: Ratio of effective questions (i.e., questions with adaptive difficulties strictly between 0 and 1) during training across various LLM-training dataset combinations. Annotated percentages indicate the per-step increase in effective question ratio achieved by DOTS compared to original GRPO, averaged across the training process. Our adaptive prediction framework consistently selects more informative samples throughout training.
  • Figure 5: Average accuracy curves of (a) DOTS vs. Original GRPO, and (b) DOTS+RR vs. DOTS on the Qwen2.5-Math-1.5B model. The curves show average performance aggregated over four benchmarks with exponential smoothing. Note that the x-axis is the number of steps (rather than time). (a) DOTS consistently outperforms the original GRPO and leads to faster convergence. (b) Incorporating RR reduces training time by 20% while preserving the performance of DOTS.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 1: Maximal Gradient Signal at 50% Success Rate
  • Theorem 1: Maximal Gradient Signal at 50% Success Rate
  • proof