Table of Contents
Fetching ...

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, Jie Tang

TL;DR

BattleAgentBench presents a fine-grained, multi-level benchmark to evaluate LLMs under cooperative and competitive multi-agent dynamics using a turn-based game environment. It defines seven stages across three difficulty levels and employs metrics such as Forward Distance, Format Accuracy, Move Accuracy, and Score to compare API-based and open-source models across single, paired, and multi-agent settings. The study finds API-based models generally outperform open-source models on simple tasks, but gaps persist on complex cooperative/competitive tasks, with a few models (e.g., Claude-based, GPT-4o variants) showing stronger coordination skills. The work provides a scalable evaluation framework, ablation analyses, and case studies that illuminate how collaboration affects outcomes, offering a foundation for advancing LLM capabilities in coordinated multi-agent systems.

Abstract

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.

BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems

TL;DR

BattleAgentBench presents a fine-grained, multi-level benchmark to evaluate LLMs under cooperative and competitive multi-agent dynamics using a turn-based game environment. It defines seven stages across three difficulty levels and employs metrics such as Forward Distance, Format Accuracy, Move Accuracy, and Score to compare API-based and open-source models across single, paired, and multi-agent settings. The study finds API-based models generally outperform open-source models on simple tasks, but gaps persist on complex cooperative/competitive tasks, with a few models (e.g., Claude-based, GPT-4o variants) showing stronger coordination skills. The work provides a scalable evaluation framework, ablation analyses, and case studies that illuminate how collaboration affects outcomes, offering a foundation for advancing LLM capabilities in coordinated multi-agent systems.

Abstract

Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.
Paper Structure (18 sections, 4 equations, 6 figures, 6 tables)

This paper contains 18 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overall evaluation framework of the BattleAgentBench.
  • Figure 2: Level 1: Stage 1 and Stage 2. The agent's goal in both stages is to reach the base location.
  • Figure 3: Stages of Level 2 (double-agent level). In Stage 3 and Stage 4, the two agents have a cooperative relationship and a competitive relationship respectively.
  • Figure 4: Stages of Level 3 (multi-agent level). In Stage 5, the agents within the team have a cooperative relationship, while the agents between teams have a competitive relationship. In Stage 6, the agents between teams have a competitive relationship, while allowing for cooperative relationships between teams. In Stage 7, the relationship within the team is cooperative, the relationship between teams is competitive, and cooperation between teams is also allowed.
  • Figure 5: Goal completion rate of different models.
  • ...and 1 more figures