AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Yinyi Luo; Yiqiao Jin; Weichen Yu; Mengqi Zhang; Srijan Kumar; Xiaoxiao Li; Weijie Xu; Xin Chen; Jindong Wang

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Yinyi Luo, Yiqiao Jin, Weichen Yu, Mengqi Zhang, Srijan Kumar, Xiaoxiao Li, Weijie Xu, Xin Chen, Jindong Wang

TL;DR

AgentArk tackles the inefficiency and vulnerability of multi-agent reasoning by distilling interaction-induced reasoning dynamics into a single large language model. It introduces three hierarchical distillation strategies—Reasoning-Enhanced SFT (RSFT), Distillation with Data Augmentation (DA), and Process-Aware Distillation (PAD)—built on a data pipeline of multi-agent debate, correctness-driven trajectory extraction, and process-level supervision. PAD, which uses a Process Reward Model with a contrastive loss and Group Relative Policy Optimization, consistently yields the strongest improvements in reasoning structure, self-checking, and generalization, including transfer to multimodal LLMs. Across diverse backbones and benchmarks like GSM8K, MATH, MedMCQA, and QMSum, AgentArk delivers notable gains in single-agent performance, robustness, and cross-domain reasoning while reducing inference-time costs relative to MAS. The findings support the viability of internalizing MAS reasoning signals to enable efficient, robust, and scalable reasoning in resource-constrained deployments.

Abstract

While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment is limited by their high computational cost and error propagation. This paper proposes AgentArk, a novel framework to distill multi-agent dynamics into the weights of a single model, effectively transforming explicit test-time interactions into implicit model capabilities. This equips a single agent with the intelligence of multi-agent systems while remaining computationally efficient. Specifically, we investigate three hierarchical distillation strategies across various models, tasks, scaling, and scenarios: reasoning-enhanced fine-tuning; trajectory-based augmentation; and process-aware distillation. By shifting the burden of computation from inference to training, the distilled models preserve the efficiency of one agent while exhibiting strong reasoning and self-correction performance of multiple agents. They further demonstrate enhanced robustness and generalization across diverse reasoning tasks. We hope this work can shed light on future research on efficient and robust multi-agent development. Our code is at https://github.com/AIFrontierLab/AgentArk.

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

TL;DR

Abstract

Paper Structure (45 sections, 13 equations, 10 figures, 13 tables)

This paper contains 45 sections, 13 equations, 10 figures, 13 tables.

Introduction
Related Work
Method
Overview
Data Generation and Knowledge Extraction
Distillation Methods
Reasoning-Enhanced SFT
Distillation with Data Augmentation
Process-Aware Distillation
Experiments
Experimental Setup
Main Results
Scaling and Data Dynamics
Analysis of the Reasoning Quality
Robustness and Generalization
...and 30 more sections

Figures (10)

Figure 1: AgentArk distills the reasoning capability of multi-agent systems into one single agent, such that this single unit can imitate the thinking process with boosted performance.
Figure 2: Overview of AgentArk. The pipeline proceeds through three stages: (1) Data Generation Through Multi-Agent Debate to produce diverse reasoning trajectories; (2) Knowledge Extraction to filters for high-quality corrective traces; and (3) Distillation utilizing Standard SFT, Reasoning-enhanced SFT, Distillation with Data Augmentation, and Process-Aware Distillation (PRM + GRPO). The resulting student model achieves optimized, low-latency reasoning that generalizes across diverse task domains.
Figure 3: Distillation from Qwen3-32B to different student models.
Figure 4: Effect of agent scale ($5,10,20$) on distillation performance evaluated on GSM8K and MedMCQA.
Figure 5: Data scaling behavior of distillation from Qwen3-32B to Qwen3-0.6B, showing target model performance as a function of training data size across datasets.
...and 5 more figures

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

TL;DR

Abstract

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

Authors

TL;DR

Abstract

Table of Contents

Figures (10)