Table of Contents
Fetching ...

MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems

Lei Gu, Yinghao Zhu, Haoran Sang, Zixiang Wang, Dehao Sui, Wen Tang, Ewen Harrison, Junyi Gao, Lequan Yu, Liantao Ma

TL;DR

This work advances medical AI by moving beyond final-answer accuracy to audit the internal collaborative reasoning of multi-agent systems. Through a large-scale study of 3,600 interaction logs across six medical datasets and six MAS frameworks, the authors develop a taxonomy of collaborative failure modes and a quantitative auditing framework. They identify four dominant patterns: loss of key information, suppression of correct minority opinions, overreliance on voting, and misprioritization of high-risk outcomes, revealing a disconnect between accuracy and trustworthy clinical reasoning. The findings emphasize the need for transparent, auditable deliberation processes and provide methodological tools to diagnose and mitigate failures, enabling safer deployment of medical AI in clinical and public contexts.

Abstract

While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque "black boxes" and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.

MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems

TL;DR

This work advances medical AI by moving beyond final-answer accuracy to audit the internal collaborative reasoning of multi-agent systems. Through a large-scale study of 3,600 interaction logs across six medical datasets and six MAS frameworks, the authors develop a taxonomy of collaborative failure modes and a quantitative auditing framework. They identify four dominant patterns: loss of key information, suppression of correct minority opinions, overreliance on voting, and misprioritization of high-risk outcomes, revealing a disconnect between accuracy and trustworthy clinical reasoning. The findings emphasize the need for transparent, auditable deliberation processes and provide methodological tools to diagnose and mitigate failures, enabling safer deployment of medical AI in clinical and public contexts.

Abstract

While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque "black boxes" and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.

Paper Structure

This paper contains 51 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of the four-phase research methodology, encompassing (A) data generation, (B) qualitative analysis and taxonomy development, (C) quantitative auditing design, and (D) quantitative auditing and reporting.
  • Figure 2: An overview of the AuditTrail framework, comprising four mechanisms designed to quantify and record the multi-agent collaborative process.
  • Figure 3: A comprehensive taxonomy of collaborative failure modes in medical multi-agent systems. The taxonomy is structured chronologically across four phases of a collaborative task. Phase 1 (Task Comprehension) identifies initial errors from gaps in the base model's capabilities and incorrect problem scoping. Phase 2 (Collaboration Process) details dysfunctions during agent interaction, such as convergence towards a flawed consensus and ineffective dissent handling. Phase 3 (Final Decision-Making) addresses breakdowns in viewpoint aggregation and information loss. Phase 4 (Framework Design) covers overarching issues in architectural design and evaluation benchmarks.
  • Figure 4: A comprehensive taxonomy of collaborative success modes in medical multi-agent systems. The taxonomy is structured thematically into three core mechanisms of success. S1 (Consensus via Evidentiary Corroboration) encompasses scenarios where agents achieve a correct outcome by reinforcing and validating initial correct judgments through multi-perspective evidence. S2 (Error Correction via Interaction) details the processes where the collaboration actively corrects an initial error, such as when an expert agent's opinion overrides an incorrect majority or when clarifying definitions resolves a misunderstanding. S3 (Agent-Driven Self-Correction) captures instances of intra-agent reflection, where an agent proactively identifies and rectifies its own flawed arguments during the collaborative process.
  • Figure 5: Distribution of identified failure modes across the four chronological phases of the collaborative process.
  • ...and 5 more figures