Table of Contents
Fetching ...

Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

TL;DR

This work tackles predicting Builder action sequences in the Minecraft Collaborative Building Task from natural language instructions by reframing the problem as code generation using large language models. It employs few-shot in-context prompting and, separately, fine-tuning (Q-LORA) to generate code snippets (place(...) and pick(...)) that represent builder actions. GPT-4 delivers the strongest performance (micro-F1 around 0.39), with open-source LLMs like Llama-3-70b close behind, and a modest ~6% improvement from fine-tuning a smaller model. The study also analyzes error sources—spatial prepositions, shapes, and anaphora—and highlights dataset-ground-truth issues (builder mistakes) that limit evaluation, offering insights for improving grounded language understanding in situated action prediction.

Abstract

In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs' in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work

Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft

TL;DR

This work tackles predicting Builder action sequences in the Minecraft Collaborative Building Task from natural language instructions by reframing the problem as code generation using large language models. It employs few-shot in-context prompting and, separately, fine-tuning (Q-LORA) to generate code snippets (place(...) and pick(...)) that represent builder actions. GPT-4 delivers the strongest performance (micro-F1 around 0.39), with open-source LLMs like Llama-3-70b close behind, and a modest ~6% improvement from fine-tuning a smaller model. The study also analyzes error sources—spatial prepositions, shapes, and anaphora—and highlights dataset-ground-truth issues (builder mistakes) that limit evaluation, offering insights for improving grounded language understanding in situated action prediction.

Abstract

In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs' in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work
Paper Structure (24 sections, 6 figures, 2 tables)

This paper contains 24 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Illustration of the LLM interpreting block placement instructions. The initial world view is empty. The LLM receives instructions from User A and generates action predictions.
  • Figure 2: Voxel representations for sample turns that correspond to spatial preposition, geometric shape, and anaphora categories. Two samples for each category are given. Samples on the left side are generated correctly while samples on the right hand side have mistakes that are highlighted.
  • Figure 3: Prompt template used for the action prediction task. The system information specifies system level behavior, the environment information indicates the environment details of the user-agent environment, the context information describes the in-context examples, task information indicates the specific response format to follow.
  • Figure 4: Retrieval of relevant in-context examples based on current test instruction
  • Figure 5: Excerpt of an utterance that contains the builder mistakes from the game-id: B29-A1-C151-1524078449685. The action sequence pairs where an item is first placed and later picked up is highlighted with the same colour.
  • ...and 1 more figures