A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

Chris Madge; Massimo Poesio

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

Chris Madge, Massimo Poesio

TL;DR

The paper presents a synthetic, Minecraft-inspired benchmark to evaluate LLMs on spatial reasoning and vector-based math, addressing gaps in traditional text-only benchmarks. It introduces three task modes—Absolute Addressing, Relative Addressing, and Primitive Shapes—to probe distinct spatial competencies, and compares prompting strategies including Zero-shot, Few-shot, and Chain-of-Thought using a large language model. Key findings show that Chain-of-Thought prompts help LLMs better handle $3D$ coordinate reasoning and reduce axis-related errors, while different addressing modes reveal specific weaknesses. The benchmark provides diagnostic insights for builder-agent design and supports targeted improvements in spatial reasoning and vector math capabilities within voxel/grid-based environments. Overall, this work lays groundwork for robust evaluation of LLM-driven builders in spatially structured tasks.

Abstract

In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

TL;DR

coordinate reasoning and reduce axis-related errors, while different addressing modes reveal specific weaknesses. The benchmark provides diagnostic insights for builder-agent design and supports targeted improvements in spatial reasoning and vector math capabilities within voxel/grid-based environments. Overall, this work lays groundwork for robust evaluation of LLM-driven builders in spatially structured tasks.

Abstract

Paper Structure (14 sections, 1 figure, 1 table)

This paper contains 14 sections, 1 figure, 1 table.

Introduction
Our Approach
Absolute Addressing
Relative Addressing
Primitive Shapes
Results
Conclusion
Appendix
B1-A3-C8-1522432497234
B1-A3-C4-1522432009099
B1-A3-C1-1522435497386
B3-A2-C12-1522445699382
B3-A2-C23-1522447244858
B1-A3-C3-1522431780184

Figures (1)

Figure 1: Relative positioning task, placing a green block on top of an existing blue block

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

TL;DR

Abstract

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

Authors

TL;DR

Abstract

Table of Contents

Figures (1)