Table of Contents
Fetching ...

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

Isadora White, Kolby Nottingham, Ayush Maniar, Max Robinson, Hansen Lillemark, Mehul Maheshwari, Lianhui Qin, Prithviraj Ammanabrolu

TL;DR

This work tackles the challenge of collaborative embodied reasoning with large language models by introducing Mindcraft, a Minecraft-based platform, and MineCollab, a benchmark of cooking, crafting, and construction tasks requiring multi-agent coordination. It demonstrates that current LLMs struggle with efficient, long-horizon collaboration and that heavy reliance on natural language for planning can degrade performance, motivating methods beyond prompting and imitation learning. The authors provide a modular toolkit (47 high-level actions, conversation management, and RAG prompts) plus a sizable dataset for supervised fine-tuning (SFT) and evaluation across 2–5 agent scenarios, highlighting both the potential and limitations of current approaches. Overall, Mindcraft and MineCollab offer a scalable, reproducible framework to study embodied, NL-grounded collaboration and to drive progress toward more capable multi-agent AI systems in complex environments.

Abstract

Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

TL;DR

This work tackles the challenge of collaborative embodied reasoning with large language models by introducing Mindcraft, a Minecraft-based platform, and MineCollab, a benchmark of cooking, crafting, and construction tasks requiring multi-agent coordination. It demonstrates that current LLMs struggle with efficient, long-horizon collaboration and that heavy reliance on natural language for planning can degrade performance, motivating methods beyond prompting and imitation learning. The authors provide a modular toolkit (47 high-level actions, conversation management, and RAG prompts) plus a sizable dataset for supervised fine-tuning (SFT) and evaluation across 2–5 agent scenarios, highlighting both the potential and limitations of current approaches. Overall, Mindcraft and MineCollab offer a scalable, reproducible framework to study embodied, NL-grounded collaboration and to drive progress toward more capable multi-agent AI systems in complex environments.

Abstract

Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/

Paper Structure

This paper contains 44 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Task suites and challenges. In this figure, we see the collaborative and embodied reasoning challenges displayed. In the cooking and crafting tasks, the agents need to delegate tasks, share resources and use embodied planning to manipulate the world of Minecraft. In the construction tasks, the agents need to navigate and coordinate in the space to ensure they consistently build towards their objective without undoing any progress the other agents have made. All together these tasks comprehensively test collaborative and embodied reasoning.
  • Figure 2: Overview of the mindcraft workflow. A user or task configuration (left) provides instructions (e.g., “Build a house out of nearby materials”). The Agent (center) takes these instructions, consults an LLM (via a model request) and invokes high-level commands/tools. These commands are then executed in the Minecraft environment (right), with the agent receiving feedback through execution logs. The extensive command library in mindcraft enables flexible, plug-and-play experimentation with collaborative and embodied LLM agents in a partially observable Minecraft world.
  • Figure 3: Task complexity ablations. In the first row, we ablate different numbers of agents in the crafting and cooking tasks. Construction tasks can also be run with 3+ agent tasks, but are outside of our budget for closed source APIs. In the second row, we ablate access to hidden plan information like the recipe for a cake (cooking) or the steps to make a bookshelf (crafting) find that models drop by over 15% when forced to communicate these plans. In the third row, we ablate the complexity of the blueprints by increasing the number of rooms and unique materials - testing different levels of embodied reasoning. We find that performance drops across llama3.3-70b-instruct and gpt-4o by 10% with the complexity of the blueprint.