AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent
Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang
TL;DR
AgentArk tackles the inefficiency and vulnerability of multi-agent reasoning by distilling interaction-induced reasoning dynamics into a single large language model. It introduces three hierarchical distillation strategies—Reasoning-Enhanced SFT (RSFT), Distillation with Data Augmentation (DA), and Process-Aware Distillation (PAD)—built on a data pipeline of multi-agent debate, correctness-driven trajectory extraction, and process-level supervision. PAD, which uses a Process Reward Model with a contrastive loss and Group Relative Policy Optimization, consistently yields the strongest improvements in reasoning structure, self-checking, and generalization, including transfer to multimodal LLMs. Across diverse backbones and benchmarks like GSM8K, MATH, MedMCQA, and QMSum, AgentArk delivers notable gains in single-agent performance, robustness, and cross-domain reasoning while reducing inference-time costs relative to MAS. The findings support the viability of internalizing MAS reasoning signals to enable efficient, robust, and scalable reasoning in resource-constrained deployments.
Abstract
While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.
