Nebula: A discourse aware Minecraft Builder
Akshay Chaturvedi, Kate Thompson, Nicholas Asher
TL;DR
The paper tackles the challenge of language-to-action in collaborative tasks by incorporating discourse structure and nonlinguistic context. It proposes Nebula, a Llama-based model fine-tuned on the Minecraft Dialogue Corpus, achieving a net-action F1 around $0.39$ on the MDC test, roughly doubling the prior baseline of $0.20$. By leveraging Narrative arcs from the MSDC, the authors show that arc-context can be sufficient for accurate action prediction within an arc, while also revealing limitations of the existing net-action F1 metric for underspecified instructions. To address these issues, they introduce synthetic datasets (level-1/level-2) and demonstrate that targeted fine-tuning improves shape and location understanding, and they propose a more realistic evaluation approach. Overall, the work demonstrates that discourse-aware LLMs can better map complex, underspecified instructions to action sequences in embodied environments and provides guidance on metric design for such evaluations.
Abstract
When engaging in collaborative tasks, humans efficiently exploit the semantic structure of a conversation to optimize verbal and nonverbal interactions. But in recent "language to code" or "language to action" models, this information is lacking. We show how incorporating the prior discourse and nonlinguistic context of a conversation situated in a nonlinguistic environment can improve the "language to action" component of such interactions. We finetune an LLM to predict actions based on prior context; our model, Nebula, doubles the net-action F1 score over the baseline on this task of Jayannavar et al.(2020). We also investigate our model's ability to construct shapes and understand location descriptions using a synthetic dataset
