Table of Contents
Fetching ...

Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

Yiyou Sun, Georgia Zhou, Haoyue Bai, Hao Wang, Dacheng Li, Nouha Dziri, Dawn Song

TL;DR

The paper investigates what supervised fine-tuning with reasoning trajectories adds to LLMs' mathematical reasoning on the AIME24 benchmark. It identifies a ladder-like progression across Easy, Medium, Hard, and Extremely Hard (Exh) levels and systematically analyzes the data, trajectory styles, and scaling needed to move between tiers. The key findings show rapid gains from Easy to Med with modest SFT data, while Hard-level performance scales slowly with data and is limited by exploration stability and subgoal complexity; Exh-level problems remain largely unsolved, signaling fundamental limits of SFT alone. The work offers a roadmap highlighting where scaling helps, where it saturates, and why solving Exh-level problems may require approaches beyond standard SFT, such as novel reasoning strategies or tool integration.

Abstract

Recent supervised fine-tuning (SFT) approaches have significantly improved language models' performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model's errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.

Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

TL;DR

The paper investigates what supervised fine-tuning with reasoning trajectories adds to LLMs' mathematical reasoning on the AIME24 benchmark. It identifies a ladder-like progression across Easy, Medium, Hard, and Extremely Hard (Exh) levels and systematically analyzes the data, trajectory styles, and scaling needed to move between tiers. The key findings show rapid gains from Easy to Med with modest SFT data, while Hard-level performance scales slowly with data and is limited by exploration stability and subgoal complexity; Exh-level problems remain largely unsolved, signaling fundamental limits of SFT alone. The work offers a roadmap highlighting where scaling helps, where it saturates, and why solving Exh-level problems may require approaches beyond standard SFT, such as novel reasoning strategies or tool integration.

Abstract

Recent supervised fine-tuning (SFT) approaches have significantly improved language models' performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model's errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.

Paper Structure

This paper contains 25 sections, 28 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Climbing the Reasoning Ladder on AIME24. Left: A conceptual illustration of what it takes for a base model to tackle increasingly difficult problems (Easy → Med → Hard → Exh) on the AIME24 benchmark. Model performance improves from the Qwen2.5-32B-Instruct, to small-scale SFT (1k R1-trajectory SFT), and further to large-scale post-training or tool-augmentation. However, the strongest models still fall short of human expert performance at the most challenging level. Right: Averaged accuracy across AIME24 question IDs, sorted by increasing overall difficulty (as determined by the average accuracy across six models: Qwen2.5-32B-Instructqwen2.5, S1.1-32Bmuennighoff2025s1, LIMO-32Bye2025limo, Deepseek-R1guo2025deepseek, Qwq-32Bqwq32b, and STILL3-32Bchen2025still). Each model attempts each question 8 times with averaged accuracy. Colored lines represent the mean performance for each model category.
  • Figure 2: Performance comparison of the base model across various SFT trajectory settings. The analysis includes variations by question categories, training data size, CoT trajectory lengths (short [sh], normal [nm], large [lg]), and trajectory styles (Gemini-style vs. R1-style). The orange dashed line denotes the soft passline ($\sim$90% accuracy) for Med-level question accuracy.
  • Figure 3: Trajectory similarity scores between various models (SFT-ed in different math domains) and Deepseek-R1 when solving Med-level math problems. Similarities were assessed on a scale from 0 (totally different) to 5 (almost identical).
  • Figure 4: Performance scaling of models via SFT on Hard-level reasoning tasks. We use $*$ symbol to denote the public models. Specifically, 114K$^*$ corresponds to Openthinker-32Bopenthoughts and 1M$^*$ corresponds to Openthinker2-32B.
  • Figure 5: The Reasoning Ladder on AIME25. Averaged accuracy across AIME25 question IDs, sorted by increasing overall difficulty (as determined by the average accuracy across six models: Qwen2.5-32B-Instructqwen2.5, S1.1-32Bmuennighoff2025s1, LIMO-32Bye2025limo, Deepseek-R1guo2025deepseek, Qwq-32Bqwq32b, and Openthinker2-32Bopenthoughts). Each model attempts each question 8 times with averaged accuracy. Colored lines represent the mean performance for each model category.
  • ...and 4 more figures