Table of Contents
Fetching ...

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

Andries Smit, Paul Duckworth, Nathan Grinsztajn, Thomas D. Barrett, Arnu Pretorius

TL;DR

The paper benchmarks multi-agent debate (MAD) strategies for large language models across medical and reasoning datasets, comparing them to self-consistency and other ensembling methods. Using GPT-3.5-turbo and a unified evaluation suite, it shows MAD does not universally outperform non-debate prompts, though careful hyperparameter tuning—especially agreement intensity—can unlock competitive gains, even surpassing some baselines on certain tasks. The work highlights that MAD performance is highly dataset- and configuration-dependent, and provides an open-source toolkit to probe these strategies further. Overall, it suggests MAD has substantial potential but requires careful design and hyperparameter control to realize gains, with practical implications for cost and reliability in real-world QA tasks.

Abstract

Recent advancements in large language models (LLMs) underscore their potential for responding to inquiries in various domains. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a promising strategy for enhancing the truthfulness of LLMs. We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. Importantly, we find that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies, such as self-consistency and ensembling using multiple reasoning paths. However, when performing hyperparameter tuning, several MAD systems, such as Multi-Persona, perform better. This suggests that MAD protocols might not be inherently worse than other approaches, but that they are more sensitive to different hyperparameter settings and difficult to optimize. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels, which can significantly enhance performance and even surpass all other non-debate protocols we evaluated. We provide an open-source repository to the community with several state-of-the-art protocols together with evaluation scripts to benchmark across popular research datasets.

Should we be going MAD? A Look at Multi-Agent Debate Strategies for LLMs

TL;DR

The paper benchmarks multi-agent debate (MAD) strategies for large language models across medical and reasoning datasets, comparing them to self-consistency and other ensembling methods. Using GPT-3.5-turbo and a unified evaluation suite, it shows MAD does not universally outperform non-debate prompts, though careful hyperparameter tuning—especially agreement intensity—can unlock competitive gains, even surpassing some baselines on certain tasks. The work highlights that MAD performance is highly dataset- and configuration-dependent, and provides an open-source toolkit to probe these strategies further. Overall, it suggests MAD has substantial potential but requires careful design and hyperparameter control to realize gains, with practical implications for cost and reliability in real-world QA tasks.

Abstract

Recent advancements in large language models (LLMs) underscore their potential for responding to inquiries in various domains. However, ensuring that generative agents provide accurate and reliable answers remains an ongoing challenge. In this context, multi-agent debate (MAD) has emerged as a promising strategy for enhancing the truthfulness of LLMs. We benchmark a range of debating and prompting strategies to explore the trade-offs between cost, time, and accuracy. Importantly, we find that multi-agent debating systems, in their current form, do not reliably outperform other proposed prompting strategies, such as self-consistency and ensembling using multiple reasoning paths. However, when performing hyperparameter tuning, several MAD systems, such as Multi-Persona, perform better. This suggests that MAD protocols might not be inherently worse than other approaches, but that they are more sensitive to different hyperparameter settings and difficult to optimize. We build on these results to offer insights into improving debating strategies, such as adjusting agent agreement levels, which can significantly enhance performance and even surpass all other non-debate protocols we evaluated. We provide an open-source repository to the community with several state-of-the-art protocols together with evaluation scripts to benchmark across popular research datasets.
Paper Structure (20 sections, 16 figures, 2 tables)

This paper contains 20 sections, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Benchmark of experiment configurations on MedQA dataset.Left: Accuracy vs average cost ($) per question. The size of the dots reflects the average number of API calls required per question. Right: Summarizes accuracy grouped by strategy, sorted by average performance (black dot). The X represents improved performance using our proposed agreement modulation, described in Section \ref{['sec:agreement_modulation']}.
  • Figure 2: Boxplots illustrating the distribution of total accuracy across different datasets for each QA system. These plots demonstrate the variability and robustness of each system's performance.
  • Figure 3: Comparing medical vs non-medical datasets. The red line indicates the average performance of single-agent systems. Left: Accuracy averaged over the three medical datasets. Right: Accuracy averaged over the three non-medical datasets.
  • Figure 4: Relative accuracy improvements, averaged on all datasets, between the answer of the first agent given during the first and last round, as well as with the final answer.
  • Figure 5: Left: Multi-Persona's accuracy according to the agreement fixed by the prompt. Right: Multi-Persona's accuracy according to the averaged agreement during the first debate round.
  • ...and 11 more figures