Table of Contents
Fetching ...

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

Chris Madge, Massimo Poesio

TL;DR

The paper presents a synthetic, Minecraft-inspired benchmark to evaluate LLMs on spatial reasoning and vector-based math, addressing gaps in traditional text-only benchmarks. It introduces three task modes—Absolute Addressing, Relative Addressing, and Primitive Shapes—to probe distinct spatial competencies, and compares prompting strategies including Zero-shot, Few-shot, and Chain-of-Thought using a large language model. Key findings show that Chain-of-Thought prompts help LLMs better handle $3D$ coordinate reasoning and reduce axis-related errors, while different addressing modes reveal specific weaknesses. The benchmark provides diagnostic insights for builder-agent design and supports targeted improvements in spatial reasoning and vector math capabilities within voxel/grid-based environments. Overall, this work lays groundwork for robust evaluation of LLM-driven builders in spatially structured tasks.

Abstract

In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task

TL;DR

The paper presents a synthetic, Minecraft-inspired benchmark to evaluate LLMs on spatial reasoning and vector-based math, addressing gaps in traditional text-only benchmarks. It introduces three task modes—Absolute Addressing, Relative Addressing, and Primitive Shapes—to probe distinct spatial competencies, and compares prompting strategies including Zero-shot, Few-shot, and Chain-of-Thought using a large language model. Key findings show that Chain-of-Thought prompts help LLMs better handle coordinate reasoning and reduce axis-related errors, while different addressing modes reveal specific weaknesses. The benchmark provides diagnostic insights for builder-agent design and supports targeted improvements in spatial reasoning and vector math capabilities within voxel/grid-based environments. Overall, this work lays groundwork for robust evaluation of LLM-driven builders in spatially structured tasks.

Abstract

In this work we proposing adapting the Minecraft builder task into an LLM benchmark suitable for evaluating LLM ability in spatially orientated tasks, and informing builder agent design. Previous works have proposed corpora with varying complex structures, and human written instructions. We instead attempt to provide a comprehensive synthetic benchmark for testing builder agents over a series of distinct tasks that comprise of common building operations. We believe this approach allows us to probe specific strengths and weaknesses of different agents, and test the ability of LLMs in the challenging area of spatial reasoning and vector based math.
Paper Structure (14 sections, 1 figure, 1 table)

This paper contains 14 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Relative positioning task, placing a green block on top of an existing blue block