Table of Contents
Fetching ...

The impact of multi-agent debate protocols on debate quality: a controlled case study

Ramtin Zargari Marandi

Abstract

In multi-agent debate (MAD) systems, performance gains are often reported; however, because the debate protocol (e.g., number of agents, rounds, and aggregation rule) is typically held fixed while model-related factors vary, it is difficult to disentangle protocol effects from model effects. To isolate these effects, we compare three main protocols, Within-Round (WR; agents see only current-round contributions), Cross-Round (CR; full prior-round context), and novel Rank-Adaptive Cross-Round (RA-CR; dynamically reorders agents and silences one per round via an external judge model), against a No-Interaction baseline (NI; independent responses without peer visibility). In a controlled macroeconomic case study (20 diverse events, five random seeds, matched prompts/decoding), RA-CR achieves faster convergence than CR, WR shows higher peer-referencing, and NI maximizes Argument Diversity (unaffected across the main protocols). These results reveal a trade-off between interaction (peer-referencing rate) and convergence (consensus formation), confirming protocol design matters. When consensus is prioritized, RA-CR outperforms the others.

The impact of multi-agent debate protocols on debate quality: a controlled case study

Abstract

In multi-agent debate (MAD) systems, performance gains are often reported; however, because the debate protocol (e.g., number of agents, rounds, and aggregation rule) is typically held fixed while model-related factors vary, it is difficult to disentangle protocol effects from model effects. To isolate these effects, we compare three main protocols, Within-Round (WR; agents see only current-round contributions), Cross-Round (CR; full prior-round context), and novel Rank-Adaptive Cross-Round (RA-CR; dynamically reorders agents and silences one per round via an external judge model), against a No-Interaction baseline (NI; independent responses without peer visibility). In a controlled macroeconomic case study (20 diverse events, five random seeds, matched prompts/decoding), RA-CR achieves faster convergence than CR, WR shows higher peer-referencing, and NI maximizes Argument Diversity (unaffected across the main protocols). These results reveal a trade-off between interaction (peer-referencing rate) and convergence (consensus formation), confirming protocol design matters. When consensus is prioritized, RA-CR outperforms the others.

Paper Structure

This paper contains 20 sections, 3 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: Controlled workflow for dataset construction, matched protocol execution, metric computation, and paired inference. Protocol is the manipulated factor, while prompts, decoding settings, model assignment, and judge-guided candidate selection are held fixed across conditions.
  • Figure 2: Protocol comparison under a shared two-round timeline for WR, CR, RA-CR, and NI.
  • Figure 3: Aggregate comparison of the three primary debate protocols plus the No-Interaction baseline from the SBERT top-20, five-seed aggregate (100 matched units per condition): (a) PRR, (b) AD, and (c) CF. Bars show condition means with 95% bootstrap confidence intervals. NI is omitted from panel (a) because PRR is structurally zero under no peer visibility. The figure highlights the central trade-off of the study: WR is stronger on interaction-oriented metrics, whereas RA-CR is stronger on convergence-oriented metrics. Significance markers: $*$$p<0.05$, $**$$p<0.01$, $***$$p<0.001$ after Holm--Bonferroni adjustment.