Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate
Andrea Wynn, Harsh Satija, Gillian Hadfield
TL;DR
The paper questions the assumption that multi-agent debate always improves AI reasoning, showing that heterogeneous agent settings can degrade accuracy due to social influence and sequential revision biases. Through a systematic empirical study on CSQA, MMLU, and GSM8K with diverse LLMs, it identifies failure modes such as sycophancy, conformity, and reasoning cascades, and it tests interventions like correctness-payoff prompts. The findings reveal that debate can harm performance and that diversity does not guarantee gains, highlighting the need for principled design choices that encourage critical evaluation, weighting by expertise, and independent verification. The work provides dataset-driven insights and code to guide future development of safer, more reliable multi-agent reasoning systems.
Abstract
While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. Prior work has primarily focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time - even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. We perform additional experiments investigating various potential contributing factors to these harmful shifts - including sycophancy, social conformity, and model and task type. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivised nor adequately equipped to resist persuasive but incorrect reasoning.
