Table of Contents
Fetching ...

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, Jesse Thomason

TL;DR

The paper tackles the vulnerability of majority voting in multi-agent LLM reasoning when errors cluster and confound the group. It introduces AgentAuditor, which builds a Reasoning Tree from agent traces, audits only at decision-critical Divergence Points using Divergence Packets, and employs Anti-Consensus Preference Optimization to resist sycophancy. Across multiple MAS settings and LLM backbones, AgentAuditor achieves up to 5 percentage-point improvements over MV and up to 3 points over LLM-as-Judge, while reducing token usage. This work shifts MAS aggregation from frequency-based voting to evidence-based adjudication, offering a scalable and robust approach for complex, multi-agent reasoning tasks.

Abstract

Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.

Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

TL;DR

The paper tackles the vulnerability of majority voting in multi-agent LLM reasoning when errors cluster and confound the group. It introduces AgentAuditor, which builds a Reasoning Tree from agent traces, audits only at decision-critical Divergence Points using Divergence Packets, and employs Anti-Consensus Preference Optimization to resist sycophancy. Across multiple MAS settings and LLM backbones, AgentAuditor achieves up to 5 percentage-point improvements over MV and up to 3 points over LLM-as-Judge, while reducing token usage. This work shifts MAS aggregation from frequency-based voting to evidence-based adjudication, offering a scalable and robust approach for complex, multi-agent reasoning tasks.

Abstract

Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.
Paper Structure (71 sections, 2 theorems, 21 equations, 5 figures, 5 tables)

This paper contains 71 sections, 2 theorems, 21 equations, 5 figures, 5 tables.

Key Result

Proposition 3.1

If $\rho=0$ and $p>1/2$, then $\mathrm{Var}(\bar{X})=O(1/N)$ and $\bar{X}$ concentrates around $p$, recovering the classical CJT intuition. If $\rho>0$ is bounded away from $0$, then $\mathrm{Var}(\bar{X})$ does not vanish as $N\to\infty$ (it approaches $p(1-p)\rho$), so increasing the number of age

Figures (5)

  • Figure 1: Majority voting vs. AgentAuditor.Left: Majority voting can follow the herd into a dominant but wrong consensus. Right: AgentAuditor audits localized branch evidence on a reasoning tree to reliably select the correct minority answer. This contrasts frequency-based selection with evidence-based adjudication under confabulation consensus.
  • Figure 2: Overall architecture of AgentAuditor framework. Given a multi-agent slate of reasoning traces, AgentAuditor performs structural semantic deduplication to construct a compact Reasoning Tree of distinct hypotheses. It then audits only decision-critical Divergence Points by comparing localized branch evidence, selecting the winning hypothesis and propagating its answer as the final aggregation. For learnable auditing, we train the Auditor with Anti-Consensus Preference Optimization on consensus-trap instances.
  • Figure 3: Key module ablations for AgentAuditor. Removing the conditional beam hurts performance, while LLM-based splitting and embeddings yield only minor changes.
  • Figure 4: Case Study. Majority Voting fails under confabulation consensus, while AgentAuditor prunes decision-critical divergences by flagging a fatal unit mismatch (mixing cheese/pepperoni across different pizza sizes) and a later constraint violation (spurious "Kate"), thereby isolating the correct solution.
  • Figure 5: Case Study. An example where majority voting converges to an incorrect consensus due to correlated fluent errors. AgentAuditor audits only decision-critical divergence points, detects a flavor-unit mismatch and an unsupported population assumption, prunes the invalid branches early, and selects the correct per-flavor conversion path.

Theorems & Definitions (2)

  • Proposition 3.1: Failure of independence assumption
  • Proposition 3.2: Deduplication removes the "quantity advantage"