Table of Contents
Fetching ...

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Chenruo Liu, Yijun Dong, Yiqiu Shen, Qi Lei

TL;DR

Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, this work proves quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks.

Abstract

Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

TL;DR

Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, this work proves quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks.

Abstract

Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.
Paper Structure (29 sections, 19 theorems, 406 equations, 4 figures, 3 tables)

This paper contains 29 sections, 19 theorems, 406 equations, 4 figures, 3 tables.

Key Result

Theorem 4.1

Fix an iteration $t$ with current model $\hat{\theta}_t$. Let $\mathcal{Q}$ denote the question space and let $\Delta(\mathcal{A})$ be the set of probability measures on the answer space $\mathcal{A}$. Let $\Pi \subset (\mathcal{Q} \to \Delta(\mathcal{A}))$ be a finite model class, and suppose that where $\alpha^{(m)}(\hat{\theta}_t,q) := 1-(1-\alpha(\hat{\theta}_t,q))^m,$$Z^{(m)}_{p_0}(\hat{\the

Figures (4)

  • Figure 1: Feasible initialization region. Panels (a,c) report Monte-Carlo estimates of the length of the initialization interval $V_{p_0}(\hat{\theta}_0)$ for which $\{F^{\circ t}(V_{p_0}(\hat{\theta}_0))\}_{t\ge 0}$ and $\{(H_t\circ\cdots\circ H_0)(V_{p_0}(\hat{\theta}_0))\}_{t\ge 0}$ are both monotonically increasing in $t$, under different $(\beta',\beta,\nu)$ settings. Panels (b,d) show the length of the feasibility interval $\mathcal{I}_{\mathcal{M}}(\beta',\beta,\nu)$ in Corollary \ref{['cor:e2h-feasible-init-length']}. Panels (a,b): fix $\beta'=0.1$ and vary $(\beta,\nu)$. Panels (c,d): fix $\beta=0.4$ and vary $(\beta',\nu)$.
  • Figure 2: Improvement initialization region. Panels (a)-(c) report Monte-Carlo estimates of the length of the initialization interval $V_{p_0}(\hat{\theta}_0)$ for which $(G\circ H_{L-1}\circ H_{L-2}\circ\cdots\circ H_0)(V_{p_0}(\hat{\theta}_0)) > F^{\circ L}(V_{p_0}(\hat{\theta}_0))$ holds under different $(\beta',\beta,\nu)$ settings. Panel (a): fix $\beta'=0.1$ and vary $(\beta,\nu)$. Panel (b): fix $\beta=0.4$ and vary $(\beta',\nu)$. Panel (c): fix $\Delta=0.1$ and vary $(\beta',\nu)$; the same panel also includes a zoomed-in view for small $\beta'$.
  • Figure 3: Iterative self-improvement. Panel (a) shows the self-improvement trajectories of a fixed $\hat{\theta}_0$ across tasks with different initial Pass@1 accuracies; hollow markers and faded line segments indicate model collapse (Pass@1$=0$ for at least one target distance $l$). Panel (b) shows the performance under different question budgets $n$, with $\hat{\theta}_0$ and the initial Pass@1 fixed. Panel (c) shows the performance under different per-question answer budgets $m$, with $\hat{\theta}_0$ and the initial Pass@1 fixed.
  • Figure 4: Iterative self-improvement with an easy-to-hard curriculum. Panel (a) fixes $\Delta=0.04$ and $\hat{\theta}_0$, and shows for different initial Pass@1 accuracies, the final Pass@1 gap between easy-to-hard and the baseline (i.e., $V_{p_0}(\hat{\theta}^{\mathrm{E2H}}_L)-V_{p_0}(\hat{\theta}^{\mathrm{B}}_L)$) as a function of the adjacent task difficulty ratio (captured by $\beta'$). Panel (b) fixes $\Delta=0.04$ and $\beta'=0.25$, and shows for different initial Pass@1 accuracies (spanning $35\%$--$55\%$), how the final Pass@1 gap varies with the question budget $n$. The solid line reports the mean gap across 15 initializations and the shaded region indicates $\pm 1$ standard error; the gray bars (right axis) show the number of initializations with a positive gap at each $n$.

Theorems & Definitions (41)

  • Theorem 4.1
  • Remark 4.2: On the model class $\Pi$
  • Corollary 4.4
  • Proposition 4.5
  • Theorem 5.2
  • Corollary 5.3
  • Remark 5.4: Feasibility disfavors small budgets and large adjacent difficulty ratios
  • Proposition 5.5
  • Corollary 5.6
  • Definition A.1: $\mathcal{I}(a,\nu)$
  • ...and 31 more