Beyond the Strongest LLM: Multi-Turn Multi-Agent Orchestration vs. Single LLMs on Benchmarks
Aaron Xuxiang Tian, Ruofan Zhang, Jiayao Tang, Young Min Cho, Xueqian Li, Qiang Yi, Ji Wang, Zhunping Zhang, Danrui Qi, Zekun Li, Xingyu Xiang, Sharath Chandra Guntuku, Lyle Ungar, Tianyu Shi, Chi Wang
TL;DR
This work investigates whether multi-turn, multi-agent orchestration of heterogeneous LLMs can outperform strong single-LLM baselines across three benchmarks (GPQA-Diamond, IFEval, MuSR). It introduces a structured coordination framework with three phases (Agent Action, Consensus, Final Presentation) and a dynamic restart mechanism to integrate new candidates while preventing premature consensus. Across two experiments using four LLMs, orchestration consistently matches or surpasses the strongest individual model and reveals that coordination choices—such as authorship disclosure and visible vote tallies—shape turn-level behavior and consensus, with headroom remaining for efficiency gains. The findings highlight the potential of coordinated multi-agent systems to achieve robust reasoning and guide future framework design to close the gap to best-possible performance.
Abstract
We study multi-turn multi-agent orchestration, where multiple large language model (LLM) agents interact over multiple turns by iteratively proposing answers or casting votes until reaching consensus. Using four LLMs (Gemini 2.5 Pro, GPT-5, Grok 4, and Claude Sonnet 4) on GPQA-Diamond, IFEval, and MuSR, we conduct two experiments: (i) benchmarking orchestration against single-LLM baselines; and (ii) ablations on GPQA-Diamond that vary whether agents see who authored answers and whether they can observe ongoing votes. Orchestration matches or exceeds the strongest single model and consistently outperforms the others. Analysis of best-achievable orchestration performance shows potential for further gains. The ablations show that revealing authorship increases self-voting and ties, and that showing ongoing votes amplifies herding, which speeds convergence but can sometimes yield premature consensus.
