BattleAgentBench: A Benchmark for Evaluating Cooperation and Competition Capabilities of Language Models in Multi-Agent Systems
Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, Jie Tang
TL;DR
BattleAgentBench presents a fine-grained, multi-level benchmark to evaluate LLMs under cooperative and competitive multi-agent dynamics using a turn-based game environment. It defines seven stages across three difficulty levels and employs metrics such as Forward Distance, Format Accuracy, Move Accuracy, and Score to compare API-based and open-source models across single, paired, and multi-agent settings. The study finds API-based models generally outperform open-source models on simple tasks, but gaps persist on complex cooperative/competitive tasks, with a few models (e.g., Claude-based, GPT-4o variants) showing stronger coordination skills. The work provides a scalable evaluation framework, ablation analyses, and case studies that illuminate how collaboration affects outcomes, offering a foundation for advancing LLM capabilities in coordinated multi-agent systems.
Abstract
Large Language Models (LLMs) are becoming increasingly powerful and capable of handling complex tasks, e.g., building single agents and multi-agent systems. Compared to single agents, multi-agent systems have higher requirements for the collaboration capabilities of language models. Many benchmarks are proposed to evaluate their collaborative abilities. However, these benchmarks lack fine-grained evaluations of LLM collaborative capabilities. Additionally, multi-agent collaborative and competitive scenarios are ignored in existing works. To address these two problems, we propose a benchmark, called BattleAgentBench, which defines seven sub-stages of three varying difficulty levels and conducts a fine-grained evaluation of language models in terms of single-agent scenario navigation capabilities, paired-agent task execution abilities, and multi-agent collaboration and competition capabilities. We conducted extensive evaluations on leading four closed-source and seven open-source models. Experimental results indicate that API-based models perform excellently on simple tasks but open-source small models struggle with simple tasks. Regarding difficult tasks that require collaborative and competitive abilities, although API-based models have demonstrated some collaborative capabilities, there is still enormous room for improvement.
