Table of Contents
Fetching ...

Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki

TL;DR

This work addresses the theoretical foundations of curriculum post-training for transformer-based reasoning by modeling reasoning as a 2S-ART (two-state conditioned autoregressive reasoning tree) and grounding a strong base model via PART (uniform-branching base). It proves that, under mild base-model coverage and stepwise difficulty alignment, curriculum strategies convert exponential-depth learning costs into polynomial ones, both in RL finetuning and in test-time scaling. The analysis provides explicit complexity bounds for depth-increasing and hint-decreasing curricula, and shows that transformers can realize the PART base behavior, enabling provable benefits for post-training reasoning. These results offer a principled explanation for observed Curriculum-style gains in CoT reasoning and illuminate practical implications for post-training protocols and inference efficiency, while outlining avenues for extending the theory to broader task classes and model families.

Abstract

Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training

TL;DR

This work addresses the theoretical foundations of curriculum post-training for transformer-based reasoning by modeling reasoning as a 2S-ART (two-state conditioned autoregressive reasoning tree) and grounding a strong base model via PART (uniform-branching base). It proves that, under mild base-model coverage and stepwise difficulty alignment, curriculum strategies convert exponential-depth learning costs into polynomial ones, both in RL finetuning and in test-time scaling. The analysis provides explicit complexity bounds for depth-increasing and hint-decreasing curricula, and shows that transformers can realize the PART base behavior, enabling provable benefits for post-training reasoning. These results offer a principled explanation for observed Curriculum-style gains in CoT reasoning and illuminate practical implications for post-training protocols and inference efficiency, while outlining avenues for extending the theory to broader task classes and model families.

Abstract

Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.

Paper Structure

This paper contains 21 sections, 22 theorems, 244 equations, 2 figures, 6 algorithms.

Key Result

Theorem 1

Consider a curriculum of $K$ tasks $\pi_0^\star,\pi_1^\star,\ldots,\pi_K^\star$ with base model $\pi_{\mathrm{ref}}:=\pi^{\star}_{0}$ and target task $\pi^\star:=\pi_K^\star$. Denote the learning complexityThe concrete complexity is left to be defined for certain specific scenario. For instance, it for some constant $C^{\star}>1$, where $\widetilde{\Theta}(\cdot)$ hide logarithmic factors in a co

Figures (2)

  • Figure 1: An illustration in liu2025UFT of the Chain-of-Thought for Countdown game, where the goal is to obtain 24 by applying basic arithmetic operations $(+, -, \times, \div)$ ($\Phi_l(\cdot,\cdot)$ in our Def. \ref{['def:2SART-model']}) between the current step's number (e.g., $13$, $65$ or $72$) and some unused number (e.g., $5$, $7$ or $3$) in $\{3,5,7,13\}$, targeting $24$ as the final outcome. Per parashar2025curriculum, the difficulty measure of Countdown is the number of arithmetic operations required to solve an instance.
  • Figure 2: Reasoning tree for parity problems with $d=K=3$ and input $x_1, x_2, x_3, \mathrm{EOS}$. The nodes on the 2nd--4th levels denote hypotheses about which index is the current secret index, corresponding to $x_{i_1}$, $x_{i_2}$, and $x_{i_3}$, respectively. In our parity CoT class $f\in\mathcal{F}_{\text{$2$S-ART}}^{\text{parity}}$, each step actually consists of two actions: (i) choose the next secret index $i_t$; (ii) after the choice, apply an XOR over $z_{t-1}$ and $x_{i_t}$ as $z_{t}=\Phi_{l}(z_{t-1},x_{i_t}):=z_{t-1}\oplus x_{i_t}$, as formalized in Eq. (\ref{['eq:parity_CoT']}). For visual clarity, the tree only displays the index-selection branches and omits the explicit XOR updates. $\mathrm{E}$ denotes the $\mathrm{EOS}$ token. For parity tasks, there are illegal children $\notin\mathcal{I}_l$ for each parent that violate the "legal" criteria: non-repeating indices and strictly increasing order ($i_1 < i_2 < i_3 < \cdots$); thus, for any parity problem, CoTs with duplicate variables or decreasing index order are illegal.

Theorems & Definitions (30)

  • Theorem 1
  • Definition 1: $2$-States Conditioned Autoregressive Reasoning Tree ($2$S-ART)
  • Definition 2: Probabilistic $2$S-ART Base Model (PART)
  • Corollary 1: Exponential Decay of Success Probability with Depth
  • Theorem 2: Base Model as PART ($\operatorname{TF}(\cdot;\mathbf{W}_{\mathrm{base}})$)
  • Theorem 3: Curriculum RL Finetuning Avoid Exponential Bottleneck
  • Theorem 4: Curriculum Test-time Scaling Avoid Exponential Bottleneck
  • Theorem 5: Curriculum Learning with Spanner Sampling
  • Remark 6: Exponential--Polynomial Gaps Under Relaxed Conditions
  • Remark 7: Relation to E2H (CRL) theory and limits for post-training
  • ...and 20 more