Table of Contents
Fetching ...

Self-Evolving Curriculum for LLM Reasoning

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo

TL;DR

This work tackles the sensitivity of RL fine-tuning for LLM reasoning to the training curriculum. It introduces Self-Evolving Curriculum (SEC), which casts curriculum selection as a non-stationary multi-armed bandit and uses the absolute advantage from policy gradients as a proxy learning signal, updated with TD(0) while sampling via Boltzmann exploration. Across planning, inductive reasoning, and mathematics, SEC yields consistent improvements, especially on harder and out-of-distribution problems, and it balances multi-task learning when fine-tuning across domains. The results suggest adaptive, automated curricula can significantly enhance RL-based reasoning in LLMs and generalize across model sizes and tasks.

Abstract

Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

Self-Evolving Curriculum for LLM Reasoning

TL;DR

This work tackles the sensitivity of RL fine-tuning for LLM reasoning to the training curriculum. It introduces Self-Evolving Curriculum (SEC), which casts curriculum selection as a non-stationary multi-armed bandit and uses the absolute advantage from policy gradients as a proxy learning signal, updated with TD(0) while sampling via Boltzmann exploration. Across planning, inductive reasoning, and mathematics, SEC yields consistent improvements, especially on harder and out-of-distribution problems, and it balances multi-task learning when fine-tuning across domains. The results suggest adaptive, automated curricula can significantly enhance RL-based reasoning in LLMs and generalize across model sizes and tasks.

Abstract

Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

Paper Structure

This paper contains 25 sections, 8 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Curriculum matters. A deliberately poor (reverse) curriculum severely limits RL fine-tuning performance. Our proposed Self-Evolving Curriculum (SEC) significantly outperforms the standard random curriculum. See Sec. \ref{['sec:setup']} for details.
  • Figure 2: Overview of Self-Evolving Curriculum (SEC). SEC dynamically adjusts the training curriculum according to the model’s current capabilities. During preprocessing, training data is partitioned into distinct categories (indicated by colors), e.g., by difficulty level or problem type. At each RL fine-tuning step: (1) The curriculum policy samples a training batch based on categories' expected learning gains; (2) The LLM policy is updated using the sampled batch and the chosen RL algorithm; (3) Rewards for curriculum categories are computed using advantage values estimated by the RL algorithm; (4) The curriculum policy is updated accordingly, refining future data selection.
  • Figure 3: Average sample difficulty over training steps. SEC adaptively adjusts task difficulty during RL fine-tuning. Blue curves represent the sampled difficulty, smoothed using a moving average, while the green dashed line indicates the mean difficulty of the dataset. Across all benchmarks (columns) and model sizes (top: Qwen2.5-3B, bottom, Qwen2.5-7B), SEC initially selects easier problems and progressively introduces more challenging ones as training proceeds, effectively aligning difficulty with model improvement.
  • Figure 4: Performance comparison when training on multiple tasks.Left: Test accuracy of Qwen2.5-3B on ID and OOD splits. SEC-2D is implemented by defining an arm for each combination of problem type and difficulty level. SEC-2D consistently achieves higher accuracy, showing improved generalization compared to a random curriculum across tasks. Right: Countdown OOD accuracy vs. training steps, smoothed by a moving average. The random curriculum’s performance collapses mid-training, highlighting its inability to effectively balance multiple tasks. In contrast, SEC-2D maintains stable performance throughout training.
  • Figure S1: Distribution of difficulty levels in the MATH training set.