Table of Contents
Fetching ...

AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes

Jiahao Qiu, Xinzhe Juan, Yimin Wang, Ling Yang, Xuan Qi, Tongcheng Zhang, Jiacheng Guo, Yifu Lu, Zixin Yao, Hongru Wang, Shilong Liu, Xun Jiang, Liu Leqi, Mengdi Wang

TL;DR

AgentDistill introduces a training-free agent distillation pipeline that transfers task-solving capabilities from large teacher agents to small student agents through distilled Model–Context–Protocols (MCPs). By extracting MCPs from successful trajectories and consolidating them into a reusable MCP-Box, students can perform tool-based reasoning at inference without fine-tuning or trajectory replay. Across biomedical and mathematical benchmarks, MCP-equipped students approach or match teacher performance and outperform retrieval-based baselines, demonstrating strong generalization with low overhead. The approach decouples task semantics from implementation, enabling scalable, domain-agnostic tool usage and efficient deployment of lightweight agents in novel environments.

Abstract

While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored. Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments. We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents. The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains and solve new problems with minimal supervision or human intervention. Experiments on biomedical and mathematical benchmarks demonstrate that our distilled student agents, built on small language models, can achieve performance comparable to advanced systems using large LLMs such as OctoTools (GPT-4o), highlighting the effectiveness of our framework in building scalable and cost-efficient intelligent agents.

AgentDistill: Training-Free Agent Distillation with Generalizable MCP Boxes

TL;DR

AgentDistill introduces a training-free agent distillation pipeline that transfers task-solving capabilities from large teacher agents to small student agents through distilled Model–Context–Protocols (MCPs). By extracting MCPs from successful trajectories and consolidating them into a reusable MCP-Box, students can perform tool-based reasoning at inference without fine-tuning or trajectory replay. Across biomedical and mathematical benchmarks, MCP-equipped students approach or match teacher performance and outperform retrieval-based baselines, demonstrating strong generalization with low overhead. The approach decouples task semantics from implementation, enabling scalable, domain-agnostic tool usage and efficient deployment of lightweight agents in novel environments.

Abstract

While knowledge distillation has become a mature field for compressing large language models (LLMs) into smaller ones by aligning their outputs or internal representations, the distillation of LLM-based agents, which involve planning, memory, and tool use, remains relatively underexplored. Existing agent distillation methods typically replay full teacher trajectories or imitate step-by-step teacher tool usage, but they often struggle to train student agents to dynamically plan and act in novel environments. We propose AgentDistill, a novel, training-free agent distillation framework that enables efficient and scalable knowledge transfer via direct reuse of Model-Context-Protocols (MCPs), which are structured and reusable task-solving modules autonomously generated by teacher agents. The reuse of these distilled MCPs enables student agents to generalize their capabilities across domains and solve new problems with minimal supervision or human intervention. Experiments on biomedical and mathematical benchmarks demonstrate that our distilled student agents, built on small language models, can achieve performance comparable to advanced systems using large LLMs such as OctoTools (GPT-4o), highlighting the effectiveness of our framework in building scalable and cost-efficient intelligent agents.

Paper Structure

This paper contains 40 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison between traditional LLM distillation (top) and our proposed training-free agent distillation framework (bottom). Traditional LLM distillation relies on chain-of-thought prompting followed by costly fine-tuning on rationale–label pairs, whereas our method eliminates training entirely. Instead, a teacher agent autonomously generates modular and reusable Model–Context–Protocols (MCPs), which are directly integrated into student agents. This enables sLM-based agents to inherit task-solving capabilities without gradient updates or trajectory replay.
  • Figure 2: Performance comparison across three benchmarks. After AgentDistill, student agents with small language model backbone achieve performance comparable to agents using pre-defined tools (e.g., OctoTools with GPT-4o), demonstrating the effectiveness of our distillation framework.
  • Figure 3: Overview of AgentDistill, the training-free agent distillation framework via Model–Context–Protocols (MCPs). The teacher agent with large language model solves tasks by decomposing them through a Manager Agent and generating task-specific MCPs via open-source search, script generation, and virtual execution. Valid MCPs are abstracted, clustered, and consolidated into a reusable MCP-Box. At inference, the student agent with a small language model leverages this MCP-Box to perform tool-based reasoning without any fine-tuning or trajectory replay. This enables lightweight agents to inherit task-solving capabilities from stronger models efficiently.
  • Figure 4: Illustrative example of the MCP-Box construction process. Starting from two raw MCP drafts (green and blue) targeting distinct subtasks, we apply (1) abstraction to rewrite them into parameterized and reusable forms, (2) clustering to group functionally similar MCPs, and (3) consolidation to merge them into a single, general-purpose MCP (yellow) with configurable parameters. The resulting tool integrates multiple behaviors and is compatible with FastMCP execution.
  • Figure 5: AgentDistill constructs a generalizable MCP from teacher-generated subtasks. Green and blue MCPs target specific goals (e.g., bright spot detection, left-side analysis), which are consolidated into a reusable parameterized MCP (yellow). The distilled MCP enables flexible reuse by adjusting arguments like region and analysis_mode, making it adaptable to different tasks without retraining.