Table of Contents
Fetching ...

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang

TL;DR

This work tackles the cost and scalability challenge of deploying LLMs by introducing Agent Distillation, a framework that transfers agentic problem-solving behavior from large language model agents to small language models using retrieval and code tools. It introduces two key techniques: First-thought Prefix (FTP) to align teacher trajectories with instruction-tuned behavior, and Self-consistent Action Generation (SAG) to improve test-time robustness by evaluating multiple action trajectories. Across eight benchmarks spanning factual and mathematical reasoning, distilling agent behavior enables $0.5\mathrm{B}$–$3\mathrm{B}$ models to match or exceed the performance of larger CoT-distilled models, with gains amplified by FTP and SAG. The approach demonstrates practical, tool-using small agents capable of adaptive information retrieval and code execution, offering a viable path to efficient on-device or resource-constrained deployments while highlighting avenues for future improvements in trajectory generation, safety, and broader model generalization.

Abstract

Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

Distilling LLM Agent into Small Models with Retrieval and Code Tools

TL;DR

This work tackles the cost and scalability challenge of deploying LLMs by introducing Agent Distillation, a framework that transfers agentic problem-solving behavior from large language model agents to small language models using retrieval and code tools. It introduces two key techniques: First-thought Prefix (FTP) to align teacher trajectories with instruction-tuned behavior, and Self-consistent Action Generation (SAG) to improve test-time robustness by evaluating multiple action trajectories. Across eight benchmarks spanning factual and mathematical reasoning, distilling agent behavior enables models to match or exceed the performance of larger CoT-distilled models, with gains amplified by FTP and SAG. The approach demonstrates practical, tool-using small agents capable of adaptive information retrieval and code execution, offering a viable path to efficient on-device or resource-constrained deployments while highlighting avenues for future improvements in trajectory generation, safety, and broader model generalization.

Abstract

Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

Paper Structure

This paper contains 43 sections, 18 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Performance comparison of different sizes of Qwen2.5-Instruct models Qwen2.5 on the average accuracy of four factual reasoning tasks (HotpotQA HotpotQA, Bamboogle Bamboogle, MuSiQue MuSiQue, 2WikiMultiHopQA 2wiki) and four mathematical reasoning tasks (MATH MATH, GSM-Hard PAL, AIME AIME, OlymMATH OlymMath). Distillation is done using the 32B model as the teacher and models ranging from 0.5B to 7B as students. Agent distillation consistently improves the performance of smaller models across both domains by enabling them to perform code execution and retrieve information for tasks adaptively. Full results are provided in \ref{['tab:main']}.
  • Figure 2: Concept. Chain-of-Thought (CoT) distillation trains student models to mimic static reasoning traces from LLMs, but often fails when new knowledge or precise computation is needed at test time. Our proposed agent distillation instead teaches student models to think and act (e.g., retrieve facts or execute code) offering stronger generalization and better robustness to hallucination.
  • Figure 3: (a) First-thought Prefix: We prompt teacher with a CoT prompt to induce step-by-step reasoning. The first reasoning step is used as a prefix to generate an agentic trajectory, which is then distilled to a student agent to teach CoT-style reasoning initialization. (b) Self-consistent Action Generation: The agent generates multiple candidate actions and selects the one with consistent outcomes. Thoughts are omitted for brevity.
  • Figure 4: Performance comparison on the MATH subcategories and levels between CoT and Agent distillation of 3B models. Left: Accuracy by problem category. Right: Accuracy by problem difficulty level. The results highlight that $\overline{\underline{\textsc{ftp}}}$ improves the performance of small agents in harder problems.
  • Figure 5: Comparison of $\overline{\underline{\textsc{sag}}}$ in agents and self-consistencyselfconsistencyin CoT for 3B models: self-consistency in CoT is helpful in math tasks but not in factual tasks.
  • ...and 3 more figures