MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Kunlun Zhu; Hongyi Du; Zhaochen Hong; Xiaocheng Yang; Shuyi Guo; Zhe Wang; Zhenhailong Wang; Cheng Qian; Xiangru Tang; Heng Ji; Jiaxuan You

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You

TL;DR

MultiAgentBench addresses the gap in evaluating LLM-driven multi-agent systems by introducing MARBLE, a modular framework that combines an Agent Graph, a Cognitive Module, and a Coordination Engine to support diverse topologies and planning strategies. The benchmark uses milestone-based KPIs to measure not only task completion but also coordination and competition across six interactive scenarios, including mutual-goal and conflicting-goal tasks. Across experiments, model capabilities (notably gpt-4o-mini) and graph-based coordination generally outperform alternatives, while cognitive planning improves milestone achievement, revealing emergent social behaviors. The work highlights avenues for expanding scenario coverage, memory architectures, and richer competition mechanisms, contributing to progress toward scalable, collaborative AI.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

TL;DR

Abstract

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (27)