Table of Contents
Fetching ...

SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

Ruiqi Zhang, Daman Arora, Song Mei, Andrea Zanette

TL;DR

This paper tackles the compute bottleneck in RL-based training of large reasoning models by introducing SPEED, an online curriculum that selectively samples prompts of intermediate difficulty to maximize learning signal. The authors derive a theoretical link between prompt pass rate and gradient estimator SNR, showing mid-difficulty prompts yield the strongest learning signal across common policy-gradient algorithms. They design SPEED with a two-phase inference scheme and lightweight difficulty estimation to avoid unnecessary inference, achieving 2x–6x wall-clock speedups without sacrificing accuracy. Empirically, SPEED demonstrates robust improvements across multiple math-reasoning benchmarks and model scales, while remaining plug-and-play with existing RL algorithms and data without manual preprocessing.

Abstract

Training large language models with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning abilities, yet remains computationally expensive due to inefficient uniform prompt sampling. We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), an adaptive online RL curriculum that selectively chooses training examples of intermediate difficulty to maximize learning efficiency. Theoretically, we establish that intermediate-difficulty prompts improve the gradient estimator's signal-to-noise ratio, accelerating convergence. Empirically, our efficient implementation leads to 2x to 6x faster training without degrading accuracy, requires no manual tuning, and integrates seamlessly into standard RL algorithms.

SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

TL;DR

This paper tackles the compute bottleneck in RL-based training of large reasoning models by introducing SPEED, an online curriculum that selectively samples prompts of intermediate difficulty to maximize learning signal. The authors derive a theoretical link between prompt pass rate and gradient estimator SNR, showing mid-difficulty prompts yield the strongest learning signal across common policy-gradient algorithms. They design SPEED with a two-phase inference scheme and lightweight difficulty estimation to avoid unnecessary inference, achieving 2x–6x wall-clock speedups without sacrificing accuracy. Empirically, SPEED demonstrates robust improvements across multiple math-reasoning benchmarks and model scales, while remaining plug-and-play with existing RL algorithms and data without manual preprocessing.

Abstract

Training large language models with reinforcement learning (RL) against verifiable rewards significantly enhances their reasoning abilities, yet remains computationally expensive due to inefficient uniform prompt sampling. We introduce Selective Prompting with Efficient Estimation of Difficulty (SPEED), an adaptive online RL curriculum that selectively chooses training examples of intermediate difficulty to maximize learning efficiency. Theoretically, we establish that intermediate-difficulty prompts improve the gradient estimator's signal-to-noise ratio, accelerating convergence. Empirically, our efficient implementation leads to 2x to 6x faster training without degrading accuracy, requires no manual tuning, and integrates seamlessly into standard RL algorithms.

Paper Structure

This paper contains 27 sections, 2 theorems, 41 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Fix a prompt $x.$ Let $\mathcal{P}_x(\theta)$ denote the pass rate of prompt $x$ under the current policy $(\pi_\theta(\cdot|x))$: $\mathcal{P}_x(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)} [\mathbb{I}(r(y) = 1)].$ The SNR of the stochastic gradient estimator (defined in eqn.snr.definitio Moreover, for fixed $N$, we have

Figures (6)

  • Figure 1: Left: The impact of our proposed algorithm, SPEED, on accelerating RL training of LLMs. SPEED reduces inference costs by excluding extremely easy or overly difficult prompts, focusing RL fine-tuning exclusively on moderately challenging prompts that offer the highest signal-to-noise ratio. Right: We report accuracy averaged across five benchmarks and four training configurations, comparing both variants of SPEED against baseline RL algorithms. We conduct experiments using Qwen2.5-Math-7B.
  • Figure 2: Left and middle: Pass rate distribution of 1000 samples in DAPO-17k evaluated by Qwen2.5-Math-1.5B (left) and Qwen2.5-Math-7B (middle). To evaluate the pass rates, we sample $50$ responses per prompt. Right: Average per-step inference and training times while running RLOO on the Qwen2.5-Math-7B model.
  • Figure 3: Validation accuracy on various mathematical reasoning benchmarks for SPEED-variants of RL algorithms, and base RL algorithms. Top: RLOO versus SPEED-RLOO; bottom: DAPO versus SPEED-DAPO. The initial model used is Qwen2.5-Math-7B, trained on the DeepScaleR dataset.
  • Figure 4: Average training accuracy (left) and gradient norm (right) comparison between RLOO and SPEED-RLOO during training of Qwen2.5-Math-7B. For the SPEED variants, the reported accuracies on the training set are calculated exclusively using the qualified prompts that are selected in the actual training process, showing that the SPEED variant keeps feeding data at a near optimal level of difficulty even if the model's capabilities increase.
  • Figure 5: The validation accuracy on DAPO-1k (left), the average gradient norm (middle), and the average training accuracy (right) of RLOO and SPEED-RLOO with different $N_{\mathsf{init}}$. Notice that the training accuracy (right) refers to the prompts screened for optimal pass rate. We train Qwen2.5-Math-1.5B on the training split of DAPO-17k.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 3.1: Fundamental Connection between SNR and Pass Rate
  • Theorem 4.1
  • Proof 1: Proof of \ref{['fact.SGD']}
  • Proof 2: Proof of \ref{['thm.main']}
  • Proof 3: Proof of \ref{['lem.SPEED.objective.maintext']}