Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training
Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Hau-San Wong, Qingfu Zhang, Taiji Suzuki
TL;DR
This work addresses the theoretical foundations of curriculum post-training for transformer-based reasoning by modeling reasoning as a 2S-ART (two-state conditioned autoregressive reasoning tree) and grounding a strong base model via PART (uniform-branching base). It proves that, under mild base-model coverage and stepwise difficulty alignment, curriculum strategies convert exponential-depth learning costs into polynomial ones, both in RL finetuning and in test-time scaling. The analysis provides explicit complexity bounds for depth-increasing and hint-decreasing curricula, and shows that transformers can realize the PART base behavior, enabling provable benefits for post-training reasoning. These results offer a principled explanation for observed Curriculum-style gains in CoT reasoning and illuminate practical implications for post-training protocols and inference efficiency, while outlining avenues for extending the theory to broader task classes and model families.
Abstract
Recent curriculum techniques in the post-training stage of LLMs have been widely observed to outperform non-curriculum approaches in enhancing reasoning performance, yet a principled understanding of why and to what extent they work remains elusive. To address this gap, we develop a theoretical framework grounded in the intuition that progressively learning through manageable steps is more efficient than directly tackling a hard reasoning task, provided each stage stays within the model's effective competence. Under mild complexity conditions linking consecutive curriculum stages, we show that curriculum post-training avoids the exponential complexity bottleneck. To substantiate this result, drawing insights from the Chain-of-Thoughts (CoTs) solving mathematical problems such as Countdown and parity, we model CoT generation as a states-conditioned autoregressive reasoning tree, define a uniform-branching base model to capture pretrained behavior, and formalize curriculum stages as either depth-increasing (longer reasoning chains) or hint-decreasing (shorter prefixes) subtasks. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning achieves high accuracy with polynomial sample complexity, whereas direct learning suffers from an exponential bottleneck. We further establish analogous guarantees for test-time scaling, where curriculum-aware querying reduces both reward oracle calls and sampling cost from exponential to polynomial order.
