Table of Contents
Fetching ...

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy

Abstract

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Abstract

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.
Paper Structure (60 sections, 22 theorems, 98 equations, 4 figures, 1 table, 5 algorithms)

This paper contains 60 sections, 22 theorems, 98 equations, 4 figures, 1 table, 5 algorithms.

Key Result

Proposition 3.1

Let next-token prediction, $\texttt{NTP}$, denote the CoT-supervised learning algorithm which takes a dataset $D = \{ (\mathbf{x}_i,\mathbf{y}_i) \}_{i=1}^n$ of CoTs with $\mathbf{x}_i \sim \rho$ and $\mathbf{y}_i \sim \pi^\star_{1:T} (\cdot|\mathbf{x}_i)$, and returning the model, Under assump:teacher, with $\ell$ as the log-loss ($\ell (\pi,a) \equiv \log (1/\pi(a))$), $\texttt{NTP}$ has sample

Figures (4)

  • Figure 1: An example of autocurriculum for supervised fine-tuning: the learner chooses which prompts to receive teacher CoTs on, based on its accuracy. In each iteration, the learner's model is updated to digest the new supervision and improve its accuracy.
  • Figure 2: An illustration of how models trained on the appropriate prompt distributions can correct errors of prior ones. The region marked in gray captures the region correctly labeled by the plurality of the ensemble of models. Comparing $(a)$ to $(c)$, the model $\widehat{\pi}_3$ corrects some errors made by $\widehat{\pi}_1$ (green checked region), but also introduces new errors (red crossed region). When trained under the appropriate autocurriculum, the accuracy of the ensemble improves geometrically toward $1$.
  • Figure 3: An example of autocurriculum for RL: the learner chooses which prompts to add to the training batch/dataset, based on accuracy. In each iteration, the learner's model is updated on these prompts using an RL update to improve its accuracy.
  • Figure 4: For $k=120$, we plot of $\alpha^{j,k}_r$ as a function of the rank placeholder $r$ across different values of $j$ (number of models in the current ensemble). The shaded green region captures the values of the rank $r$ (for some prompt $\mathbf{x}$) such that even if the remaining $k-j$ models were to all predict the wrong label on $\mathbf{x}$, the plurality vote remains robustly correct. The shaded red region captures the values of the rank $r$ for which even if the remaining $k-j$ models were to all predict the correct label on $\mathbf{x}$, the plurality vote cannot be guaranteed to be correct. The shaded blue region plots values of the rank which are unattainable (the maximum rank achievable in iteration $j$ is $j$).

Theorems & Definitions (53)

  • Definition 2.1: Chain-of-thought (CoT) distribution
  • Definition 2.2: Outcome distribution
  • Definition 2.3: Outcome verifier
  • Definition 2.4: Learning settings
  • Proposition 3.1: Corollary of foster2024behaviorjoshi2025theory
  • Theorem 3.3: Exponential improvement via autocurriculum for SFT
  • proof
  • Remark 3.4: Comparison to active learning
  • Remark 3.5: The role of outcome-level accuracy
  • Theorem 3.6: Autocurriculum for SFT with general models
  • ...and 43 more