Table of Contents
Fetching ...

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang

Abstract

Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom ($F$), the first a priori predictor of skill utility. $F$ measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by $F$, we propose a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 targets computationally intensive iterative refinement exclusively toward "rigid" metrics ($F \lesssim 0.6$) to eliminate trajectory-local overfitting. Evaluating across 4 tasks, 11 datasets, and 6 metrics, $F$ strongly predicts skill utility ($ρ= -0.62$, $p < 0.05$). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, our adaptive agent matches or exceeds the original MAS while reducing cost up to 8$\times$ and latency by up to 15$\times$.

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

Abstract

Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom (), the first a priori predictor of skill utility. measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by , we propose a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 targets computationally intensive iterative refinement exclusively toward "rigid" metrics () to eliminate trajectory-local overfitting. Evaluating across 4 tasks, 11 datasets, and 6 metrics, strongly predicts skill utility (, ). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, our adaptive agent matches or exceeds the original MAS while reducing cost up to 8 and latency by up to 15.

Paper Structure

This paper contains 28 sections, 3 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: System overview. Our two-stage framework converts multi-agent systems into efficient single-agent skills guided by Metric Freedom $F = 1 - \rho(D^{\text{out}}, D^{\text{score}})$. Stage 1 selectively extracts tools and knowledge; task-decomposition structure is retained or discarded based on $F$. Stage 2 optionally iterates the skill via a four-agent architecture (Explore, Main, Analyzer, Runner), recommended when $F \lesssim 0.6$.
  • Figure 2: Task-level overview across all four tasks, averaged over datasets within each task. (a) Performance: the adaptive skill is the best or near-best single-agent method on every task and outperforms Original MAS on CE, CD, and matches on FE. (b) Latency and (c) Cost: single-agent methods are consistently cheaper and faster, with the gap largest on CD (8$\times$) and FE (15$\times$). Per-dataset breakdowns in Appendix \ref{['app:full-bar']}.
  • Figure 3: Freedom spectrum. Large filled circles = metric-level aggregates ($n{=}6$); small open circles = individual dataset data points ($n{=}13$). Dashed lines connect datasets to their metric mean. The negative trend ($\rho{=}-0.62$ at the data-point level; $\rho{=}-0.89$ at the metric level) confirms that rigid metrics benefit most from skill augmentation. The strongest evidence comes from comparing CE-MSA ($F{\approx}0$, upper-left) and CE-MRE ($F{\approx}0.7$--$0.9$, lower-right), computed from the same outputs.
  • Figure 4: Ablation on Text-to-SQL (BIRD-147) and Causal Estimation (average MSA). Both tools and knowledge contribute; their combination (Full) achieves the best result. "Pipe." = Structured Pipeline skill, which underperforms the full adaptive skill.
  • Figure 5: Skill iterator trajectories across all four tasks. Val score (solid) and Train score (dashed) per iteration. Green dashed line marks the selected best version. (a) CE-MSA ($F{\approx}0$): rapid gain to 100% at v2, then oscillation. (b) CD ($F{=}0.24$--$0.77$): v2 achieves best val F1 (90.9%); v1 timed out. (c) T2SQL ($F{=}0.50$): steady improvement, val EX plateaus at 86.7% from v3. (d) FE ($F{\approx}0.59$--$0.97$): marginal gains, val AUC plateaus near 65.6% after v2, consistent with high freedom.
  • ...and 2 more figures