Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Wei Yang; Defu Cao; Jiacheng Pang; Muyan Weng; Yan Liu

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Wei Yang, Defu Cao, Jiacheng Pang, Muyan Weng, Yan Liu

TL;DR

Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

Abstract

While scaling individual Large Language Models (LLMs) has delivered remarkable progress, the next frontier lies in scaling collaboration through multi-agent systems (MAS). However, purely autonomous MAS remain ''closed-world'' systems, constrained by the static knowledge horizon of pre-trained models. This limitation makes them brittle on tasks requiring knowledge beyond training data, often leading to collective failure under novel challenges. To address this, we propose the Human-In-the-Loop Multi-Agent Collaboration (HILA) framework, a principled paradigm for human--agent collaboration. HILA trains agents to learn a metacognitive policy that governs when to solve problems autonomously and when to defer to a human expert. To operationalize this policy, we introduce Dual-Loop Policy Optimization, which disentangles immediate decision-making from long-term capability growth. The inner loop applies Group Relative Policy Optimization (GRPO) with a cost-aware reward to optimize deferral decisions, while the outer loop implements continual learning, transforming expert feedback into high-quality supervised signals that strengthen the agent's reasoning ability. Experiments on challenging mathematical and problem-solving benchmarks show that HILA, equipped with Dual-Loop Policy Optimization, consistently outperforms advanced MAS, establishing a principled foundation for collaborative and continually improving agentic systems.

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

TL;DR

Abstract

Paper Structure (60 sections, 10 equations, 14 figures, 13 tables)

This paper contains 60 sections, 10 equations, 14 figures, 13 tables.

Introduction
Related Work
Methodology
Preliminaries: The Metacognitive Markov Decision Process
A Framework for Human-Agent Collaboration
Structured Cognitive State Space
The Strategic Action Space
Evaluate ($a^{\text{eval}}$): Exploiting Collective Knowledge.
Create ($a^{\text{create}}$): Creative Exploration and Hypothesis Generation.
Defer ($a^{\text{defer}}$): Risk Mitigation and Knowledge Augmentation.
Collaborative Interaction Model
Adaptive Policy Optimization with Continual Learning
Inner Loop: Reinforcement Learning for Metacognitive Policy
Reward Formulation.
GRPO Objective.
...and 45 more sections

Figures (14)

Figure 1: Overview of the proposed HILA framework and its Dual-Loop Policy Optimization (DLPO) training paradigm. Left: HILA coordinates multi-agent collaboration with both proactive human guidance and reactive expert feedback via metacognitive states and strategic actions (Eval, Create, Defer). Right: DLPO optimizes the meta-policy in an inner RL loop with cost-aware rewards, and expands the model’s knowledge boundary in an outer continual-learning loop by storing DEFER-triggered human feedback as offline supervision.
Figure 2: Effect of human proxy capability on HILA. Using stronger language models as the external expert consistently improves performance on GSM8K, AMC, and MMLU.
Figure 3: Accuracy as a function of the number of agents.
Figure 4: Accuracy as a function of the number of rounds.
Figure 6: Accuracy vs. Defer rate on GSM8K across training stages (Init $\rightarrow$ GRPO $\rightarrow$ DLPO). DLPO achieves a joint improvement, increasing accuracy while reducing reliance on external intervention.
...and 9 more figures

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

TL;DR

Abstract

Adaptive Collaboration with Humans: Metacognitive Policy Optimization for Multi-Agent LLMs with Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)