Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making
Yihan Wang, Qiao Yan, Zhenghao Xing, Lihao Liu, Junjun He, Chi-Wing Fu, Xiaowei Hu, Pheng-Ann Heng
TL;DR
The paper identifies Silent Agreement as a key bottleneck in medical multi-agent LLM frameworks and introduces the Catfish Agent, a role-based dissent mechanism designed to inject structured challenges. It implements two core interventions—complexity-aware engagement and tone-calibrated dissent—to adapt to case difficulty and consensus strength, respectively. Through extensive experiments on nine medical Q&A and three medical VQA benchmarks, the Catfish framework achieves substantial gains over single- and multi-agent baselines, including GPT-4o and DeepSeek-R1, with notable reductions in premature consensus and improved diagnostic reasoning. The work demonstrates the practical impact of deliberate disagreement in high-stakes medical decision making and outlines future directions for efficient coordination in multi-agent reasoning systems.
Abstract
Large language models (LLMs) have demonstrated strong potential in clinical question answering, with recent multi-agent frameworks further improving diagnostic accuracy via collaborative reasoning. However, we identify a recurring issue of Silent Agreement, where agents prematurely converge on diagnoses without sufficient critical analysis, particularly in complex or ambiguous cases. We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement. Inspired by the ``catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning. We formulate two mechanisms to encourage effective and context-aware interventions: (i) a complexity-aware intervention that modulates agent engagement based on case difficulty, and (ii) a tone-calibrated intervention articulated to balance critique and collaboration. Evaluations on nine medical Q&A and three medical VQA benchmarks show that our approach consistently outperforms both single- and multi-agent LLMs frameworks, including leading commercial models such as GPT-4o and DeepSeek-R1.
