Table of Contents
Fetching ...

AgentAsk: Multi-Agent Systems Need to Ask

Bohan Lin, Kuo Yang, Yingchuan Lai, Yudong Zhang, Chen Zhang, Guibin Zhang, Xinlei Yu, Miao Yu, Xu Wang, Yang Wang

TL;DR

The paper tackles unreliability in LLM-based multi-agent systems due to edge-level error cascades at message handoffs. It proposes AgentAsk, a lightweight, architecture-agnostic clarification module that inserts minimal questions to arrest error propagation via a three-stage pipeline: distill edge-level judgments, supervise a light clarifier, and optimize online with E-GRPO. It formalizes a four-type edge-level error taxonomy (Data Gap, Referential Drift, Signal Corruption, Capability Gap) and demonstrates improvements across math, reasoning, and coding benchmarks with overhead under $5\%$, offering a practical path toward more robust MAS orchestration. This work provides a scalable edge-centric design and training methodology that complements role-based governance and self-checking, enabling more reliable collaboration among LLM-driven agents.

Abstract

Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor. However, they frequently underperform single-agent baselines due to edge-level error cascades: minor inaccuracies at one message handoff propagate across the entire chain. We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation. AgentAsk follows a three-stage pipeline: (i) distilling edge-level judgments from curated failure traces into a compact policy, (ii) supervising the policy to determine when/what/whom/how to ask, and (iii) optimizing online with E-GRPO, a reinforcement learning objective that balances accuracy, latency, and cost. The module is architecture-agnostic and easy to integrate into existing orchestration. Across math, reasoning, and coding benchmarks, AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%, approaching the performance of a strong evaluator. Beyond empirical improvements, we contribute a principled taxonomy of edge-level errors and a practical recipe for link-local intervention, offering a scalable pathway toward more reliable LLM-based multi-agent systems.

AgentAsk: Multi-Agent Systems Need to Ask

TL;DR

The paper tackles unreliability in LLM-based multi-agent systems due to edge-level error cascades at message handoffs. It proposes AgentAsk, a lightweight, architecture-agnostic clarification module that inserts minimal questions to arrest error propagation via a three-stage pipeline: distill edge-level judgments, supervise a light clarifier, and optimize online with E-GRPO. It formalizes a four-type edge-level error taxonomy (Data Gap, Referential Drift, Signal Corruption, Capability Gap) and demonstrates improvements across math, reasoning, and coding benchmarks with overhead under , offering a practical path toward more robust MAS orchestration. This work provides a scalable edge-centric design and training methodology that complements role-based governance and self-checking, enabling more reliable collaboration among LLM-driven agents.

Abstract

Multi-agent systems built on large language models (LLMs) promise enhanced problem-solving capabilities through collaborative division of labor. However, they frequently underperform single-agent baselines due to edge-level error cascades: minor inaccuracies at one message handoff propagate across the entire chain. We propose AgentAsk, a lightweight and plug-and-play clarification module that treats every inter-agent message as a potential failure point and inserts minimally necessary questions to arrest error propagation. AgentAsk follows a three-stage pipeline: (i) distilling edge-level judgments from curated failure traces into a compact policy, (ii) supervising the policy to determine when/what/whom/how to ask, and (iii) optimizing online with E-GRPO, a reinforcement learning objective that balances accuracy, latency, and cost. The module is architecture-agnostic and easy to integrate into existing orchestration. Across math, reasoning, and coding benchmarks, AgentAsk consistently improves accuracy and robustness over public multi-agent implementations while keeping overhead minimal, with latency and extra cost all less than 5%, approaching the performance of a strong evaluator. Beyond empirical improvements, we contribute a principled taxonomy of edge-level errors and a practical recipe for link-local intervention, offering a scalable pathway toward more reliable LLM-based multi-agent systems.

Paper Structure

This paper contains 32 sections, 13 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Multi-Agent System Reasoning: Success vs. Cascade Failure. The left shows the normal operation of MAS, while the right shows that the upstream agents transmit errors during the interaction that trigger a cascade effect and cause the entire system to fail.
  • Figure 2: From taxonomy to design. (a) Empirical distribution of edge-level error types ($N$=824). (b) Three design inspirations distilled from our taxonomy: localize fixes at the edge; map type→intent & addressee; enforce minimality-by-design. (c) Our designed AgentAsk for edge-level intervention (see Section \ref{['sec:Methodology']}).
  • Figure 3: Overview of AgentAsk. The module operates at the edge level, treating each inter-agent message as a potential failure point. The figure illustrates both the architecture and training process: (i) knowledge distillation from failure traces using a large evaluator to construct an edge-level corpus, (ii) supervised fine-tuning of a lightweight clarifier that decides when/what/whom/how to ask, and (iii) reinforcement learning with E-GRPO for adaptive clarification under latency and cost constraints. This pipeline equips AgentAsk with edge-aware monitoring and minimal, targeted interventions while remaining architecture-agnostic and easy to integrate into diverse MASs.
  • Figure 4: (Left) Pareto frontier (x=latency, y=extra cost; bubble size=Acc), showing +AgentAsk nearing +GPT-5 at much lower overhead. (Middle) Error-Type distributions (DG/SC/RD/CG) across datasets. (Right) Sensitivity to window $H$ and penalty $\lambda_{\mathrm{sw}}$, highlighting a stable region near the default and the accuracy–efficiency trade-off.
  • Figure 5: The case of our error taxonomy. In the middle shows the fraction of the four types of errors.
  • ...and 1 more figures