Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft
Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen
TL;DR
This work tackles predicting Builder action sequences in the Minecraft Collaborative Building Task from natural language instructions by reframing the problem as code generation using large language models. It employs few-shot in-context prompting and, separately, fine-tuning (Q-LORA) to generate code snippets (place(...) and pick(...)) that represent builder actions. GPT-4 delivers the strongest performance (micro-F1 around 0.39), with open-source LLMs like Llama-3-70b close behind, and a modest ~6% improvement from fine-tuning a smaller model. The study also analyzes error sources—spatial prepositions, shapes, and anaphora—and highlights dataset-ground-truth issues (builder mistakes) that limit evaluation, offering insights for improving grounded language understanding in situated action prediction.
Abstract
In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs' in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work
