Table of Contents
Fetching ...

TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft

Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

TL;DR

<3-5 sentence high-level summary>TeamCraft introduces a large-scale, Minecraft-based benchmark for evaluating multi-modal multi-agent collaboration, using multi-modal prompts (language plus orthographic-view images) and procedurally generated task variants to probe generalization across novel goals, scenes, and agent counts. It pairs this with expert planner-based demonstrations and both centralized and decentralized control settings to study coordination under realistic sensory inputs. Experiments with vision-language-action models and GPT-4o reveal significant generalization gaps in multi-modal planning and substantial differences between multi-modal and grid-world baselines, highlighting the need for improved inter-agent communication and planning under visual constraints. The authors provide open-source code, data, and baselines to foster continued progress in embodied AI and multi-agent collaboration research.

Abstract

Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.

TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft

TL;DR

<3-5 sentence high-level summary>TeamCraft introduces a large-scale, Minecraft-based benchmark for evaluating multi-modal multi-agent collaboration, using multi-modal prompts (language plus orthographic-view images) and procedurally generated task variants to probe generalization across novel goals, scenes, and agent counts. It pairs this with expert planner-based demonstrations and both centralized and decentralized control settings to study coordination under realistic sensory inputs. Experiments with vision-language-action models and GPT-4o reveal significant generalization gaps in multi-modal planning and substantial differences between multi-modal and grid-world baselines, highlighting the need for improved inter-agent communication and planning under visual constraints. The authors provide open-source code, data, and baselines to foster continued progress in embodied AI and multi-agent collaboration research.

Abstract

Collaboration is a cornerstone of society. In the real world, human teammates make use of multi-sensory data to tackle challenging tasks in ever-changing environments. It is essential for embodied agents collaborating in visually-rich environments replete with dynamic interactions to understand multi-modal observations and task specifications. To evaluate the performance of generalizable multi-modal collaborative agents, we present TeamCraft, a multi-modal multi-agent benchmark built on top of the open-world video game Minecraft. The benchmark features 55,000 task variants specified by multi-modal prompts, procedurally-generated expert demonstrations for imitation learning, and carefully designed protocols to evaluate model generalization capabilities. We also perform extensive analyses to better understand the limitations and strengths of existing approaches. Our results indicate that existing models continue to face significant challenges in generalizing to novel goals, scenes, and unseen numbers of agents. These findings underscore the need for further research in this area. The TeamCraft platform and dataset are publicly available at https://github.com/teamcraft-bench/teamcraft.

Paper Structure

This paper contains 89 sections, 5 equations, 42 figures, 14 tables.

Figures (42)

  • Figure 1: The TeamCraft platform consists of three main components: (1) a Minecraft server that hosts the game as an online platform, (2) Mineflayer, which serves as the interface for controlling agents in the server, and (3) a Gym-like environment that provides RGB and inventory observations to the models, allowing control of multiple agents through high-level actions.
  • Figure 2: We present example task configurations, as a combination of distinct biomes, playground base blocks, task goals, target blocks materials and agent counts. Agents are initialized with unique inventories, which provide them with different capabilities to complete various activities. A detailed distribution is provided in \ref{['appendix:stat_table']}.
  • Figure 3: Multi-modal prompts are provided for all tasks. The system prompt includes both the three orthographic views and specific language instructions. Observations consist of first-person views from different agents, along with agent-specific information.
  • Figure 4: The architecture of the TeamCraft-VLA model. Multi-modal task specifications combining three orthographic views images of the task goal states and corresponding language instructions are encoded as initial input to the model. Agents inventories and visual observations are further encoded in each step to generate actions for agents. For decentralized setting, the model only has access to one agent's information, exampled by Bot2: items associated with a * represent the fact that only the data associated with agent 2 are available.
  • Figure 5: Subgoal success rate and task success rate across centralized, decentralized and grid-world settings. The leftmost column displays the Test category, which shares similar data distribution as training. The Goal, Scene and Agents categories represent generalization tasks involving unseen goals, scenes, and tasks involving four agents, respectively. Average performance is presented in the rightmost column.
  • ...and 37 more figures