TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft
Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao
TL;DR
<3-5 sentence high-level summary>TeamCraft introduces a large-scale, Minecraft-based benchmark for evaluating multi-modal multi-agent collaboration, using multi-modal prompts (language plus orthographic-view images) and procedurally generated task variants to probe generalization across novel goals, scenes, and agent counts. It pairs this with expert planner-based demonstrations and both centralized and decentralized control settings to study coordination under realistic sensory inputs. Experiments with vision-language-action models and GPT-4o reveal significant generalization gaps in multi-modal planning and substantial differences between multi-modal and grid-world baselines, highlighting the need for improved inter-agent communication and planning under visual constraints. The authors provide open-source code, data, and baselines to foster continued progress in embodied AI and multi-agent collaboration research.
Abstract
Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.
