Table of Contents
Fetching ...

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, Jun Wang

TL;DR

CurES addresses inefficiencies in RLVR-based reasoning by linking training efficiency to prompt difficulty and rollout allocation through gradient analysis. It introduces a Bayesian, low-overhead framework to estimate per-prompt accuracy and adapt both sampling and rollout budgets, guided by a Fisher-information-informed optimization and variance minimization. The approach yields a closed-form optimal prompt distribution and a principled rollout allocation rule, with Beta-Binomial updates to track prompt difficulty and mitigate distribution shift. Empirically, CurES outperforms strong baselines and converges faster across 1.5B and 7B backbones on mathematical reasoning benchmarks, demonstrating substantial gains in sample efficiency and practical impact for scalable LLM training.

Abstract

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.30} points and \textbf{+4.82} points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

TL;DR

CurES addresses inefficiencies in RLVR-based reasoning by linking training efficiency to prompt difficulty and rollout allocation through gradient analysis. It introduces a Bayesian, low-overhead framework to estimate per-prompt accuracy and adapt both sampling and rollout budgets, guided by a Fisher-information-informed optimization and variance minimization. The approach yields a closed-form optimal prompt distribution and a principled rollout allocation rule, with Beta-Binomial updates to track prompt difficulty and mitigate distribution shift. Empirically, CurES outperforms strong baselines and converges faster across 1.5B and 7B backbones on mathematical reasoning benchmarks, demonstrating substantial gains in sample efficiency and practical impact for scalable LLM training.

Abstract

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.30} points and \textbf{+4.82} points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

Paper Structure

This paper contains 17 sections, 1 theorem, 101 equations, 10 figures, 1 table.

Key Result

Lemma 1

Let $\{p_{\theta}(x), \theta \in \Theta\}$ be a Cramér-Rao regular family with parameter space $\Theta \subset \mathbb{R}^k$, where the Fisher information matrix $I(\theta)$ is non-singular. Let $g(\theta) = (g_1(\theta), \cdots, g_s(\theta))^\top$ for $s \leq k$, and assume the partial derivatives

Figures (10)

  • Figure 1: Illustration of our theoretical and practical contributions. The first part presents our theoretical analysis, which establishes the relationship between the gradient efficiency and models’ question-answering accuracy, denoted as $p_{\theta}(x)$. Building upon these insights, we develop CurES, a practical method that initially estimates $p_{\theta}(x)$ using a small rollout quantity, then reallocates prompt sampling probabilities and rollout quantities based on the estimated accuracy. We progressively enhance the confidence of these accuracy estimates through posterior estimation. The figure further contrasts CurES with existing approaches, highlighting differences in managing prompt sampling distributions of Speed-RL DBLP:journals/corr/abs-2506-09016 and rollout quantities of GVM DBLP:journals/corr/abs-2505-02391.
  • Figure 2: Comparison of learning curves between CurES and GVM across different backbone models and advantage estimators. CurES consistently outperforms GVM under the same number of training steps, demonstrating more efficient utilization of samples.
  • Figure 3: The evolution of the estimated accuracy distributions for the Qwen2.5-Math-1.5B (left) and 7B (right) models across 15 iterations. Each violin shows the distribution of accuracy across samples: the width reflects density, the central line marks the median.
  • Figure 4: Allocation of rollout quantities with respect to accuracy in CurES at different training iterations. CurES concentrates more rollouts on moderately difficult prompts.
  • Figure 5: Performance convergence of CurES on MATH500 with different sampling configurations.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Lemma 1: Cramér-Rao Inequality