Table of Contents
Fetching ...

Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Zekun Li, Xingyu Xiang, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang

TL;DR

This work investigates whether multi-turn, multi-agent orchestration of heterogeneous LLMs can outperform strong single-LLM baselines across three benchmarks (GPQA-Diamond, IFEval, MuSR). It introduces a structured coordination framework with three phases (Agent Action, Consensus, Final Presentation) and a dynamic restart mechanism to integrate new candidates while preventing premature consensus. Across two experiments using four LLMs, orchestration consistently matches or surpasses the strongest individual model and reveals that coordination choices—such as authorship disclosure and visible vote tallies—shape turn-level behavior and consensus, with headroom remaining for efficiency gains. The findings highlight the potential of coordinated multi-agent systems to achieve robust reasoning and guide future framework design to close the gap to best-possible performance.

Abstract

We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.

Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks

TL;DR

This work investigates whether multi-turn, multi-agent orchestration of heterogeneous LLMs can outperform strong single-LLM baselines across three benchmarks (GPQA-Diamond, IFEval, MuSR). It introduces a structured coordination framework with three phases (Agent Action, Consensus, Final Presentation) and a dynamic restart mechanism to integrate new candidates while preventing premature consensus. Across two experiments using four LLMs, orchestration consistently matches or surpasses the strongest individual model and reveals that coordination choices—such as authorship disclosure and visible vote tallies—shape turn-level behavior and consensus, with headroom remaining for efficiency gains. The findings highlight the potential of coordinated multi-agent systems to achieve robust reasoning and guide future framework design to close the gap to best-possible performance.

Abstract

We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.

Paper Structure

This paper contains 41 sections, 10 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Effect of coordination strategies on GPQA-Diamond. Bars show percentages under three settings: Default (Anonymous + Hidden Tally), Identified Voting, and Visible Tally. Left: Self-voting Rate, the percentage of votes an agent cast for its own answer. Middle: First-voted Selected Rate, the percentage of tasks where the answer that received the first vote became the final consensus. In the first two plots, the rightmost group "All agents" aggregates across agents. Right: Consensus Tie Rate, the percentage of tasks with no majority. The hatched bar ("$\geq$2 Self-voters") marks the subset of tie cases where at least two agents voted for themselves. Models: Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4.