Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, Xiaojie Wang
TL;DR
<3-5 sentence high-level summary> Collab-Overcooked introduces a formal, open-source benchmark for evaluating Large Language Model-based Multi-Agent Systems (LLM-MAS) in a collaborative, process-driven setting. It combines resource isolation, asymmetric task knowledge, and natural-language communication to enforce genuine collaboration across 30 tasks at 6 difficulty levels, and defines Trajectory Efficiency Scores ($TES$) and Incremental TES ($ITES$) alongside end-to-end metrics like Progress Completeness ($PC$) and Initiating/Responding Capabilities ($IC$, $RC$). The study reveals that while modern LLMs can interpret goals, active collaboration and long-horizon adaptation lag behind, with attention misalignment identified as a key bottleneck; attention-guided interventions can markedly improve performance. Human performance sets a ceiling, highlighting substantial gaps in current LLM-MAS capabilities and underscoring the need for collaborative memory and targeted fine-tuning to advance practical collaborative AI systems. The authors publicly release the benchmark and evaluation suite to accelerate standardized, open research in collaborative AI.
Abstract
Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.
