A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Chenruo Liu; Yijun Dong; Yiqiu Shen; Qi Lei

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Chenruo Liu, Yijun Dong, Yiqiu Shen, Qi Lei

TL;DR

Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, this work proves quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks.

Abstract

Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

TL;DR

Abstract

Paper Structure (29 sections, 19 theorems, 406 equations, 4 figures, 3 tables)

This paper contains 29 sections, 19 theorems, 406 equations, 4 figures, 3 tables.

Introduction
Our contributions.
Related Work
Self-improvement for LLM mathematical reasoning.
Theoretical Understanding of LLM self-improvement.
Self-distillation, self-consuming loops, and model collapse.
Problem Setup and Notation
Problem setup.
Notation.
Iterative Self-Improvement
Single-Step Self-Improvement
Multi-Step Self-Improvement
Iterative Easy-to-Hard Curriculum for Self-Improvement
Easy-to-Hard Curriculum and Baseline
Difficulty levels.
...and 14 more sections

Key Result

Theorem 4.1

Fix an iteration $t$ with current model $\hat{\theta}_t$. Let $\mathcal{Q}$ denote the question space and let $\Delta(\mathcal{A})$ be the set of probability measures on the answer space $\mathcal{A}$. Let $\Pi \subset (\mathcal{Q} \to \Delta(\mathcal{A}))$ be a finite model class, and suppose that where $\alpha^{(m)}(\hat{\theta}_t,q) := 1-(1-\alpha(\hat{\theta}_t,q))^m,$$Z^{(m)}_{p_0}(\hat{\the

Figures (4)

Figure 1: Feasible initialization region. Panels (a,c) report Monte-Carlo estimates of the length of the initialization interval $V_{p_0}(\hat{\theta}_0)$ for which $\{F^{\circ t}(V_{p_0}(\hat{\theta}_0))\}_{t\ge 0}$ and $\{(H_t\circ\cdots\circ H_0)(V_{p_0}(\hat{\theta}_0))\}_{t\ge 0}$ are both monotonically increasing in $t$, under different $(\beta',\beta,\nu)$ settings. Panels (b,d) show the length of the feasibility interval $\mathcal{I}_{\mathcal{M}}(\beta',\beta,\nu)$ in Corollary \ref{['cor:e2h-feasible-init-length']}. Panels (a,b): fix $\beta'=0.1$ and vary $(\beta,\nu)$. Panels (c,d): fix $\beta=0.4$ and vary $(\beta',\nu)$.
Figure 2: Improvement initialization region. Panels (a)-(c) report Monte-Carlo estimates of the length of the initialization interval $V_{p_0}(\hat{\theta}_0)$ for which $(G\circ H_{L-1}\circ H_{L-2}\circ\cdots\circ H_0)(V_{p_0}(\hat{\theta}_0)) > F^{\circ L}(V_{p_0}(\hat{\theta}_0))$ holds under different $(\beta',\beta,\nu)$ settings. Panel (a): fix $\beta'=0.1$ and vary $(\beta,\nu)$. Panel (b): fix $\beta=0.4$ and vary $(\beta',\nu)$. Panel (c): fix $\Delta=0.1$ and vary $(\beta',\nu)$; the same panel also includes a zoomed-in view for small $\beta'$.
Figure 3: Iterative self-improvement. Panel (a) shows the self-improvement trajectories of a fixed $\hat{\theta}_0$ across tasks with different initial Pass@1 accuracies; hollow markers and faded line segments indicate model collapse (Pass@1$=0$ for at least one target distance $l$). Panel (b) shows the performance under different question budgets $n$, with $\hat{\theta}_0$ and the initial Pass@1 fixed. Panel (c) shows the performance under different per-question answer budgets $m$, with $\hat{\theta}_0$ and the initial Pass@1 fixed.
Figure 4: Iterative self-improvement with an easy-to-hard curriculum. Panel (a) fixes $\Delta=0.04$ and $\hat{\theta}_0$, and shows for different initial Pass@1 accuracies, the final Pass@1 gap between easy-to-hard and the baseline (i.e., $V_{p_0}(\hat{\theta}^{\mathrm{E2H}}_L)-V_{p_0}(\hat{\theta}^{\mathrm{B}}_L)$) as a function of the adjacent task difficulty ratio (captured by $\beta'$). Panel (b) fixes $\Delta=0.04$ and $\beta'=0.25$, and shows for different initial Pass@1 accuracies (spanning $35\%$--$55\%$), how the final Pass@1 gap varies with the question budget $n$. The solid line reports the mean gap across 15 initializations and the shaded region indicates $\pm 1$ standard error; the gray bars (right axis) show the number of initializations with a positive gap at each $n$.

Theorems & Definitions (41)

Theorem 4.1
Remark 4.2: On the model class $\Pi$
Corollary 4.4
Proposition 4.5
Theorem 5.2
Corollary 5.3
Remark 5.4: Feasibility disfavors small budgets and large adjacent difficulty ratios
Proposition 5.5
Corollary 5.6
Definition A.1: $\mathcal{I}(a,\nu)$
...and 31 more

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

TL;DR

Abstract

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (41)