Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Nived Rajaraman; Audrey Huang; Miro Dudik; Robert Schapire; Dylan J. Foster; Akshay Krishnamurthy

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy

Abstract

Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Abstract

Paper Structure (60 sections, 22 theorems, 98 equations, 4 figures, 1 table, 5 algorithms)

This paper contains 60 sections, 22 theorems, 98 equations, 4 figures, 1 table, 5 algorithms.

Keywords.
Introduction
Contributions
\ref{['sec:distill']}: Supervised fine-tuning.
\ref{['sec:partial-coverage']}: Fine-tuning a reference model.
Organization
Preliminaries
Basic notation.
Language models.
Accuracy and verification.
Learning settings.
SFT: Fine-Tuning with Teacher Supervision
Prior Work: SFT without Curriculum
Reducing the Cost of Supervision: Autocurriculum
Main Result: Exponential Improvement in CoT Supervision via Autocurriculum
...and 45 more sections

Key Result

Proposition 3.1

Let next-token prediction, $\texttt{NTP}$, denote the CoT-supervised learning algorithm which takes a dataset $D = \{ (\mathbf{x}_i,\mathbf{y}_i) \}_{i=1}^n$ of CoTs with $\mathbf{x}_i \sim \rho$ and $\mathbf{y}_i \sim \pi^\star_{1:T} (\cdot|\mathbf{x}_i)$, and returning the model, Under assump:teacher, with $\ell$ as the log-loss ($\ell (\pi,a) \equiv \log (1/\pi(a))$), $\texttt{NTP}$ has sample

Figures (4)

Figure 1: An example of autocurriculum for supervised fine-tuning: the learner chooses which prompts to receive teacher CoTs on, based on its accuracy. In each iteration, the learner's model is updated to digest the new supervision and improve its accuracy.
Figure 2: An illustration of how models trained on the appropriate prompt distributions can correct errors of prior ones. The region marked in gray captures the region correctly labeled by the plurality of the ensemble of models. Comparing $(a)$ to $(c)$, the model $\widehat{\pi}_3$ corrects some errors made by $\widehat{\pi}_1$ (green checked region), but also introduces new errors (red crossed region). When trained under the appropriate autocurriculum, the accuracy of the ensemble improves geometrically toward $1$.
Figure 3: An example of autocurriculum for RL: the learner chooses which prompts to add to the training batch/dataset, based on accuracy. In each iteration, the learner's model is updated on these prompts using an RL update to improve its accuracy.
Figure 4: For $k=120$, we plot of $\alpha^{j,k}_r$ as a function of the rank placeholder $r$ across different values of $j$ (number of models in the current ensemble). The shaded green region captures the values of the rank $r$ (for some prompt $\mathbf{x}$) such that even if the remaining $k-j$ models were to all predict the wrong label on $\mathbf{x}$, the plurality vote remains robustly correct. The shaded red region captures the values of the rank $r$ for which even if the remaining $k-j$ models were to all predict the correct label on $\mathbf{x}$, the plurality vote cannot be guaranteed to be correct. The shaded blue region plots values of the rank which are unattainable (the maximum rank achievable in iteration $j$ is $j$).

Theorems & Definitions (53)

Definition 2.1: Chain-of-thought (CoT) distribution
Definition 2.2: Outcome distribution
Definition 2.3: Outcome verifier
Definition 2.4: Learning settings
Proposition 3.1: Corollary of foster2024behaviorjoshi2025theory
Theorem 3.3: Exponential improvement via autocurriculum for SFT
proof
Remark 3.4: Comparison to active learning
Remark 3.5: The role of outcome-level accuracy
Theorem 3.6: Autocurriculum for SFT with general models
...and 43 more

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Abstract

Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (53)