Table of Contents
Fetching ...

LADDER: Self-Improving LLMs Through Recursive Problem Decomposition

Toby Simonds, Akira Yoshiyama

TL;DR

LADDER enables autonomous self-improvement in LLMs by recursively generating and solving progressively easier problem variants, guided by a numerical verifier. The framework is extended with Test-Time Reinforcement Learning (TTRL) to tailor solutions on individual test problems, yielding state-of-the-art results on the MIT Integration Bee for a 7B model. Empirical results across Llama 3B and Qwen-2.5 7B Distilled demonstrate dramatic gains without architectural scaling or human supervision. The work suggests a general, verification-driven path to scalable AI improvement applicable to other verifiable domains beyond mathematical reasoning.

Abstract

We introduce LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), a framework which enables Large Language Models to autonomously improve their problem-solving capabilities through self-guided learning by recursively generating and solving progressively simpler variants of complex problems. Unlike prior approaches that require curated datasets or human feedback, LADDER leverages a model's own capabilities to generate easier question variants. We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems and enabling Qwen2.5 7B Deepseek-R1 Distilled to achieve 73% on the MIT Integration Bee qualifying examination. We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance. These results show how self-directed strategic learning can achieve significant capability improvements without relying on architectural scaling or human supervision.

LADDER: Self-Improving LLMs Through Recursive Problem Decomposition

TL;DR

LADDER enables autonomous self-improvement in LLMs by recursively generating and solving progressively easier problem variants, guided by a numerical verifier. The framework is extended with Test-Time Reinforcement Learning (TTRL) to tailor solutions on individual test problems, yielding state-of-the-art results on the MIT Integration Bee for a 7B model. Empirical results across Llama 3B and Qwen-2.5 7B Distilled demonstrate dramatic gains without architectural scaling or human supervision. The work suggests a general, verification-driven path to scalable AI improvement applicable to other verifiable domains beyond mathematical reasoning.

Abstract

We introduce LADDER (Learning through Autonomous Difficulty-Driven Example Recursion), a framework which enables Large Language Models to autonomously improve their problem-solving capabilities through self-guided learning by recursively generating and solving progressively simpler variants of complex problems. Unlike prior approaches that require curated datasets or human feedback, LADDER leverages a model's own capabilities to generate easier question variants. We demonstrate LADDER's effectiveness in the subject of mathematical integration, improving Llama 3.2 3B's accuracy from 1% to 82% on undergraduate-level problems and enabling Qwen2.5 7B Deepseek-R1 Distilled to achieve 73% on the MIT Integration Bee qualifying examination. We also introduce TTRL (Test-Time Reinforcement Learning), where we perform reinforcement learning on variants of test problems at inference time. TTRL enables Qwen2.5 7B Deepseek-R1 Distilled to achieve a state-of-the-art score of 90% on the MIT Integration Bee qualifying examination, surpassing OpenAI o1's performance. These results show how self-directed strategic learning can achieve significant capability improvements without relying on architectural scaling or human supervision.

Paper Structure

This paper contains 25 sections, 5 equations, 7 figures, 2 algorithms.

Figures (7)

  • Figure 1: Example of variant generation for an integration problem. Each level represents progressively simpler variants of the original problem. The tree structure ensures each variant has exactly one parent, maintaining a clear progression of difficulty.
  • Figure 2: LADDER training progression.
  • Figure 3: Performance scaling w/ variant generation.
  • Figure 4: Comparison of LLM scores by training approach.
  • Figure 5: Comparison of LLM scores on the 2025 MIT Integration Bee (7B). Average across 3 runs.
  • ...and 2 more figures