Self-Evolving Curriculum for LLM Reasoning

Xiaoyin Chen; Jiarui Lu; Minsu Kim; Dinghuai Zhang; Jian Tang; Alexandre Piché; Nicolas Gontier; Yoshua Bengio; Ehsan Kamalloo

Self-Evolving Curriculum for LLM Reasoning

Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo

TL;DR

This work tackles the sensitivity of RL fine-tuning for LLM reasoning to the training curriculum. It introduces Self-Evolving Curriculum (SEC), which casts curriculum selection as a non-stationary multi-armed bandit and uses the absolute advantage from policy gradients as a proxy learning signal, updated with TD(0) while sampling via Boltzmann exploration. Across planning, inductive reasoning, and mathematics, SEC yields consistent improvements, especially on harder and out-of-distribution problems, and it balances multi-task learning when fine-tuning across domains. The results suggest adaptive, automated curricula can significantly enhance RL-based reasoning in LLMs and generalize across model sizes and tasks.

Abstract

Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.

Self-Evolving Curriculum for LLM Reasoning

TL;DR

Abstract

Self-Evolving Curriculum for LLM Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)