Table of Contents
Fetching ...

Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning

Maximilian Mordig, Andreas Opedal, Weiyang Liu, Bernhard Schölkopf

Abstract

Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.

Rethinking Easy-to-Hard: Limits of Curriculum Learning in Post-Training for Deductive Reasoning

Abstract

Curriculum learning (CL), motivated by the intuition that learning in increasing order of difficulty should ease generalization, is commonly adopted both in pre-training and post-training of large language models (LLMs). The intuition of CL is particularly compelling for compositional reasoning, where complex problems are built from elementary inference rules; however, the actual impact of CL on such tasks remains largely underexplored. We present a systematic empirical study of CL for post-training of LLMs, using synthetic arithmetic and logical benchmarks where difficulty is characterized by reasoning complexity rather than surface-level proxies. Surprisingly, across multiple model families and curriculum schedules, we find no robust advantage in difficulty-based sequencing over standard random sampling in either accuracy or response length. These findings persist across both supervised fine-tuning (SFT) and reinforcement learning (RL) methods. Our study suggests that, in the context of deductive reasoning, the specific ordering of training examples plays a negligible role in achieving compositional generalization, challenging the practical utility of curriculum-based post-training.

Paper Structure

This paper contains 38 sections, 2 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: OOD accuracies after the final epoch for GRPO (top) and SFT (bottom) across datasets and models. We observe no consistent significant difference between standard sampling of training data and CL with an easy-to-hard curriculum strategy. \ref{['fig:grpo_accuracy_metrics_extra', 'fig:sft_accuracy_metrics_extra']} show similar results across other curriculum strategies.
  • Figure 2: OOD Response lengths on OOD data after the final epoch for GRPO and SFT for the KK dataset.
  • Figure 3: Curriculum strategies. Illustration of four of the five curriculum strategies considered in this work. Here, $D = 5$ difficulty levels, $M = 2$ epochs per curriculum phase, and $R = 2$ additional repetition of the final phase. In each epoch, $n = N/D$ datapoints are sampled from a dataset of total size $N$, with equal weight assigned to each permitted difficulty level. The Standard (uniform sampling) strategy is not shown.
  • Figure 4: Proof tree examples. Proof trees corresponding to the low difficulty examples in \ref{['tab:problem_examples_bigger']}. For LinearDepth (left) and PartWhole (middle), red text highlights the intermediate quantities that can be tracked to solve the problem iteratively. For KK (right), the full tree is shown, with the final solution marked in red. For larger problem instances, branches of the search space can often be pruned early.
  • Figure 5: Input prompt and ground-truth reasoning length as a function of difficulty. The left panel shows the input prompt length (tokenized using the tokenizer from Qwen3-0.6B), while the right panel shows the ground-truth reasoning trace length (reasoning trace plus answer). Input prompt length increases approximately linearly with difficulty. Ground-truth reasoning length also increases linearly with difficulty for MathGAP and PartWhole, and sublinearly (with a steeper slope) for KK. These trends inform the choice of maximum response length used during RL training and evaluation.
  • ...and 16 more figures