Table of Contents
Fetching ...

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

TL;DR

This work addresses the challenge of improving reasoning in language models via reinforcement learning post-training, where sparse rewards hinder learning on hard tasks. It proposes E2H Reasoner, a curriculum RL framework that decomposes tasks into easy-to-proceed levels (trivial/easy/medium) and uses probabilistic schedulers (cosine or Gaussian) to interpolate between distributions $d_1$ and $d_K$ across curriculum steps, thereby enhancing generalization. Grounded in Approximate Policy Iteration, it provides convergence guarantees and finite-sample complexity bounds, showing curriculum-based learning can require fewer total samples than direct learning. Empirically, E2H improves reasoning in small LLMs (1.5B–3B) across multiple domains, with strong gains on hard and out-of-distribution tasks, and a public implementation is available.

Abstract

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.

Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

TL;DR

This work addresses the challenge of improving reasoning in language models via reinforcement learning post-training, where sparse rewards hinder learning on hard tasks. It proposes E2H Reasoner, a curriculum RL framework that decomposes tasks into easy-to-proceed levels (trivial/easy/medium) and uses probabilistic schedulers (cosine or Gaussian) to interpolate between distributions and across curriculum steps, thereby enhancing generalization. Grounded in Approximate Policy Iteration, it provides convergence guarantees and finite-sample complexity bounds, showing curriculum-based learning can require fewer total samples than direct learning. Empirically, E2H improves reasoning in small LLMs (1.5B–3B) across multiple domains, with strong gains on hard and out-of-distribution tasks, and a public implementation is available.

Abstract

We aim to improve the reasoning capabilities of language models via reinforcement learning (RL). Recent RL post-trained models like DeepSeek-R1 have demonstrated reasoning abilities on mathematical and coding tasks. However, prior studies suggest that using RL alone to improve reasoning on inherently difficult tasks is less effective. Here, we draw inspiration from curriculum learning and propose to schedule tasks from easy to hard (E2H), allowing LLMs to build reasoning skills gradually. Our method is termed E2H Reasoner. Empirically, we observe that, although easy tasks are important initially, fading them out through appropriate scheduling is essential in preventing overfitting. Theoretically, we establish convergence guarantees for E2H Reasoner within an approximate policy iteration framework. We derive finite-sample complexity bounds and show that when tasks are appropriately decomposed and conditioned, learning through curriculum stages requires fewer total samples than direct learning. Experiments across multiple domains show that E2H Reasoner significantly improves the reasoning ability of small LLMs (1.5B to 3B), which otherwise struggle when trained with vanilla RL alone, highlighting the effectiveness of our method. Our code can be found on https://github.com/divelab/E2H-Reasoning.

Paper Structure

This paper contains 36 sections, 3 theorems, 65 equations, 7 figures, 13 tables.

Key Result

Theorem 3.1

Let $T$ be the number of API policy updates within each task. $\beta > 0$ is a tunable parameter for stepsizes specified in chen2022approximate. Under the approximate greedy update and evaluation error assumptions above, the final performance gap $\mathcal{E}_K$ satisfies: where $\eta_k := \| Q^*_k - Q^{\pi_k}_k \|_\infty$ is the per-task Bellman error, $\delta_k$ is the evaluation error, and $\e

Figures (7)

  • Figure 1: (a, b) Reinforcement learning (RL) based post-training is believed to improve accuracy at low $k$ values in pass@$k$ evaluation guo2025deepseekyue2025does, we show that E2H Reasoner, a curriculum-based RL (CRL) approach, enables LLMs to solve tasks they previously could not, outperforming base models even at higher $k$. (c) LLaMA 3.2 3B reasoning trace for Countdown gandhi2024stream after E2H Reasoner post-training.
  • Figure 2: Task Decomposition of Easy 2Hard Reasoner (E2H). E2H first decomposes the overall task into levels of increasing difficulty, namely trivial, easy, and medium, to help the LLM acquire core skills. As training progresses, E2H schedules harder tasks accordingly. See Section \ref{['subsec:schedulers']} for scheduling details.
  • Figure 3: Illustration of cosine scheduling.
  • Figure 4: Gaussian Sampler. (a) This figure represents the Gaussian sampling process. (bcd) These figures denote the sampling probabilities of different tasks changing along the training steps with different Gaussian sampler hyperparameters.
  • Figure 4: Performance of E2H Reasoner on GSM8K and AQuA, where difficulty splits are derived from error rates due to the absence of human labels. Fig. \ref{['fig:error_levels']} shows these splits, and Table \ref{['tab:gaussian_splits']} confirms robustness to the number of splits..
  • ...and 2 more figures

Theorems & Definitions (6)

  • Theorem 3.1: CRL Performance Guarantee
  • Theorem 3.2: Sample Complexity
  • proof
  • Theorem A.1: Finite-Sample Guarantee
  • proof
  • proof