Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

Isadora White; Kolby Nottingham; Ayush Maniar; Max Robinson; Hansen Lillemark; Mehul Maheshwari; Lianhui Qin; Prithviraj Ammanabrolu

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

Isadora White, Kolby Nottingham, Ayush Maniar, Max Robinson, Hansen Lillemark, Mehul Maheshwari, Lianhui Qin, Prithviraj Ammanabrolu

TL;DR

This work tackles the challenge of collaborative embodied reasoning with large language models by introducing Mindcraft, a Minecraft-based platform, and MineCollab, a benchmark of cooking, crafting, and construction tasks requiring multi-agent coordination. It demonstrates that current LLMs struggle with efficient, long-horizon collaboration and that heavy reliance on natural language for planning can degrade performance, motivating methods beyond prompting and imitation learning. The authors provide a modular toolkit (47 high-level actions, conversation management, and RAG prompts) plus a sizable dataset for supervised fine-tuning (SFT) and evaluation across 2–5 agent scenarios, highlighting both the potential and limitations of current approaches. Overall, Mindcraft and MineCollab offer a scalable, reproducible framework to study embodied, NL-grounded collaboration and to drive progress toward more capable multi-agent AI systems in complex environments.

Abstract

Collaboration is ubiquitous and essential in day-to-day life -- from exchanging ideas, to delegating tasks, to generating plans together. This work studies how LLMs can adaptively collaborate to perform complex embodied reasoning tasks. To this end we introduce MINDcraft, an easily extensible platform built to enable LLM agents to control characters in the open-world game of Minecraft; and MineCollab, a benchmark to test the different dimensions of embodied and collaborative reasoning. An experimental study finds that the primary bottleneck in collaborating effectively for current state-of-the-art agents is efficient natural language communication, with agent performance dropping as much as 15% when they are required to communicate detailed task completion plans. We conclude that existing LLM agents are ill-optimized for multi-agent collaboration, especially in embodied scenarios, and highlight the need to employ methods beyond in-context and imitation learning. Our website can be found here: https://mindcraft-minecollab.github.io/

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

TL;DR

Abstract

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)