Table of Contents
Fetching ...

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

TL;DR

Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

Abstract

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

TL;DR

Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

Abstract

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.
Paper Structure (60 sections, 10 equations, 14 figures, 13 tables)

This paper contains 60 sections, 10 equations, 14 figures, 13 tables.

Figures (14)

  • Figure 1: Overview of the proposed HILA framework and its Dual-Loop Policy Optimization (DLPO) training paradigm. Left: HILA coordinates multi-agent collaboration with both proactive human guidance and reactive expert feedback via metacognitive states and strategic actions (Eval, Create, Defer). Right: DLPO optimizes the meta-policy in an inner RL loop with cost-aware rewards, and expands the model’s knowledge boundary in an outer continual-learning loop by storing DEFER-triggered human feedback as offline supervision.
  • Figure 2: Effect of human proxy capability on HILA. Using stronger language models as the external expert consistently improves performance on GSM8K, AMC, and MMLU.
  • Figure 3: Accuracy as a function of the number of agents.
  • Figure 4: Accuracy as a function of the number of rounds.
  • Figure 6: Accuracy vs. Defer rate on GSM8K across training stages (Init $\rightarrow$ GRPO $\rightarrow$ DLPO). DLPO achieves a joint improvement, increasing accuracy while reducing reliance on external intervention.
  • ...and 9 more figures