Table of Contents
Fetching ...

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You

TL;DR

MultiAgentBench addresses the gap in evaluating LLM-driven multi-agent systems by introducing MARBLE, a modular framework that combines an Agent Graph, a Cognitive Module, and a Coordination Engine to support diverse topologies and planning strategies. The benchmark uses milestone-based KPIs to measure not only task completion but also coordination and competition across six interactive scenarios, including mutual-goal and conflicting-goal tasks. Across experiments, model capabilities (notably gpt-4o-mini) and graph-based coordination generally outperform alternatives, while cognitive planning improves milestone achievement, revealing emergent social behaviors. The work highlights avenues for expanding scenario coverage, memory architectures, and richer competition mechanisms, contributing to progress toward scalable, collaborative AI.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents

TL;DR

MultiAgentBench addresses the gap in evaluating LLM-driven multi-agent systems by introducing MARBLE, a modular framework that combines an Agent Graph, a Cognitive Module, and a Coordination Engine to support diverse topologies and planning strategies. The benchmark uses milestone-based KPIs to measure not only task completion but also coordination and competition across six interactive scenarios, including mutual-goal and conflicting-goal tasks. Across experiments, model capabilities (notably gpt-4o-mini) and graph-based coordination generally outperform alternatives, while cognitive planning improves milestone achievement, revealing emergent social behaviors. The work highlights avenues for expanding scenario coverage, memory architectures, and richer competition mechanisms, contributing to progress toward scalable, collaborative AI.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.

Paper Structure

This paper contains 94 sections, 5 equations, 27 figures, 8 tables.

Figures (27)

  • Figure 1: Overview of MultiAgentBench evaluation process: Multi-Agent System Coordination in various interactive environments, with a focus on task performance, and coordination.
  • Figure 2: MARBLE : showcasing interactions between task information, persona data, domain databases, memory modules, and the environment through the coordinate engine and cognitive module.
  • Figure 3: Illustration of coordination protocols and planning prompt strategies. (a) shows centralized and decentralized planning structures (e.g., star, tree, graph, and chain). (b) describes strategies like group discussions and cognitive prompts, incorporating iterative feedback and task updates for effective planning.
  • Figure 4: Illustration of our benchmark curation and the dynamic milestones detecting for KPI metric.
  • Figure 5: Comparison of Different Coordination Protocols.—Tree, Star, Graph, and Chain—across multiple evaluation metrics. Specially, the token usages are scaled such that the lowest value is $0$ and the hightest value is $100$. Details about metrics used for research task can be found at \ref{['research_task_metric']}
  • ...and 22 more figures