Table of Contents
Fetching ...

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

Andrea Wynn, Harsh Satija, Gillian Hadfield

TL;DR

The paper questions the assumption that multi-agent debate always improves AI reasoning, showing that heterogeneous agent settings can degrade accuracy due to social influence and sequential revision biases. Through a systematic empirical study on CSQA, MMLU, and GSM8K with diverse LLMs, it identifies failure modes such as sycophancy, conformity, and reasoning cascades, and it tests interventions like correctness-payoff prompts. The findings reveal that debate can harm performance and that diversity does not guarantee gains, highlighting the need for principled design choices that encourage critical evaluation, weighting by expertise, and independent verification. The work provides dataset-driven insights and code to guide future development of safer, more reliable multi-agent reasoning systems.

Abstract

While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. Prior work has primarily focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time - even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. We perform additional experiments investigating various potential contributing factors to these harmful shifts - including sycophancy, social conformity, and model and task type. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivised nor adequately equipped to resist persuasive but incorrect reasoning.

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate

TL;DR

The paper questions the assumption that multi-agent debate always improves AI reasoning, showing that heterogeneous agent settings can degrade accuracy due to social influence and sequential revision biases. Through a systematic empirical study on CSQA, MMLU, and GSM8K with diverse LLMs, it identifies failure modes such as sycophancy, conformity, and reasoning cascades, and it tests interventions like correctness-payoff prompts. The findings reveal that debate can harm performance and that diversity does not guarantee gains, highlighting the need for principled design choices that encourage critical evaluation, weighting by expertise, and independent verification. The work provides dataset-driven insights and code to guide future development of safer, more reliable multi-agent reasoning systems.

Abstract

While multi-agent debate has been proposed as a promising strategy for improving AI reasoning ability, we find that debate can sometimes be harmful rather than helpful. Prior work has primarily focused on debates within homogeneous groups of agents, whereas we explore how diversity in model capabilities influences the dynamics and outcomes of multi-agent interactions. Through a series of experiments, we demonstrate that debate can lead to a decrease in accuracy over time - even in settings where stronger (i.e., more capable) models outnumber their weaker counterparts. Our analysis reveals that models frequently shift from correct to incorrect answers in response to peer reasoning, favoring agreement over challenging flawed reasoning. We perform additional experiments investigating various potential contributing factors to these harmful shifts - including sycophancy, social conformity, and model and task type. These results highlight important failure modes in the exchange of reasons during multi-agent debate, suggesting that naive applications of debate may cause performance degradation when agents are neither incentivised nor adequately equipped to resist persuasive but incorrect reasoning.

Paper Structure

This paper contains 18 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: In many cases, we find that group accuracy frequently degrades over the course of debate, rather than improving performance. Diverse refers to the case (1x , 1x , 1x ).
  • Figure 2: Breakdown of how agent change answers for different agent settings; results are aggregated over all debate rounds. We observe that most agents with incorrect initial answers do not improve their overall performance (peach bars), and, of those that do change their answers, more change from a correct answer to an incorrect one (red region) than from an incorrect to a correct one (green region).
  • Figure 3: Breakdown of how agents change answers between debate rounds for different agent settings. The top row denotes the first round, and the row below denotes the second round. We find that the social effect dominates: agents that can resist flipping originally correct answers in round 1 have lower resistance to the social pressure from disagreement after round 2.
  • Figure 4: The likelihood of an answer being flipped from correct to incorrect (the undesirable flip direction), plotted against the number of agents who agree with the ego agent, averaged across all rounds of debate. We find that the number of other agents who agree with the agent appears to be correlated with the frequency with which the agents flip their answers, indicating that the models may be influenced by social effects. We further observe significant variance in this answer-flipping behavior conditioned on the specific dataset or model in question.
  • Figure 5: A comparison between the types of answer flips when agents are given the base prompt or the correctness payoff prompt. We find that introducing the correctness payoff intervention does not appear to decrease the number of undesirable correct $\rightarrow$ incorrect flips -- indicating that this intervention against sycophancy is insufficient on its own to resolve the issues with multi-agent debate.