Table of Contents
Fetching ...

From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

Yuanjie Lyu, Chengyu Wang, Jun Huang, Tong Xu

TL;DR

SCoRe introduces a student-centered distillation paradigm for LLM agents, where the student generates trajectories and a teacher intervenes only at the earliest error. This enables capability-matched data and deficiency localization, reducing error propagation from $O(H^2)$ to $O(H)$ and fostering genuine problem-solving beyond imitation. The approach combines an initial Code-as-Action distillation with Mentored Problem-Solving and a short-horizon RL phase using key-step rewards, yielding strong gains across math, factual reasoning, and deep-search tasks. Empirically, a 7B-parameter SCoRe student can match or closely approach a 72B teacher on 12 benchmarks, while offering substantial cost and latency advantages.

Abstract

Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and enhances training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.

From Correction to Mastery: Reinforced Distillation of Large Language Model Agents

TL;DR

SCoRe introduces a student-centered distillation paradigm for LLM agents, where the student generates trajectories and a teacher intervenes only at the earliest error. This enables capability-matched data and deficiency localization, reducing error propagation from to and fostering genuine problem-solving beyond imitation. The approach combines an initial Code-as-Action distillation with Mentored Problem-Solving and a short-horizon RL phase using key-step rewards, yielding strong gains across math, factual reasoning, and deep-search tasks. Empirically, a 7B-parameter SCoRe student can match or closely approach a 72B teacher on 12 benchmarks, while offering substantial cost and latency advantages.

Abstract

Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student can cause compounding errors. We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix preceding the earliest error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and enhances training stability. On 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.

Paper Structure

This paper contains 16 sections, 6 theorems, 37 equations, 7 figures, 6 tables.

Key Result

Theorem 3.1

If student $\hat{\pi}$ trained on teacher $\pi_E$ demonstrations via BC satisfies then

Figures (7)

  • Figure 1: Comparison between imitation-based distillation and our SCoRe framework. (a) Prior methods clone entire teacher trajectories. (b) Our approach lets the student explore, with the teacher correcting only the earliest error. Correction-based SFT mitigates the compounding errors of pure imitation. RL rollouts then start from this verified prefix, improving stability and efficiency.
  • Figure 2: The SCoRe framework. (a) A student agent attempts a task, and the teacher provides a single-step correction at the first error, creating student-centric training data. (b) The student is initially trained to imitate full solution trajectories via supervised fine-tuning. (c) The student is further improved through reinforcement learning, using shortened rollouts starting from the prefix preceding the teacher's correction, and targeted rewards at the corrected steps to guide exploration.
  • Figure 3: Teacher intervention frequency and performance on "hard data" after training. Categories: 0 = solved by the student solely, 1 = one teacher correction, $\geq$2 = two or more than two corrections. Hard data = unsolved samples even with the teacher's help.
  • Figure 4: Performance of models SFT on MPS-generated data (data scales: 10K, 5K, 2K), compared to an RL-trained model. For math tasks, performance is measured as agreement between generated and reference answers, using Qwen2.5-72B-Instruct; QA tasks are evaluated using the F1 score for answer similarity. The evaluation protocol matches that used in the main paper.
  • Figure 5: Performance comparison of SCoRe-SFT, SCoRe-RL, and a DPO baseline. While DPO uses the same MPS-generated data as SCoRe-RL in a preference-learning formulation, it yields only marginal gains over SCoRe-SFT. The evaluation protocol matches that used in the main paper.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 3.1: BC compounding-error bound
  • Theorem 3.2: SCoRe first-error-correction bound
  • Theorem 3.3: Variance Bound for Shortened Rollout
  • Theorem A.1: BC compounding-error bound
  • proof
  • Theorem A.2: SCoRe first-error-correction bound
  • proof
  • Theorem A.3: Variance bound for shortened rollout
  • proof