Agents in the Sandbox: End-to-End Crash Bug Reproduction for Minecraft

Eray Yapağcı; Yavuz Alp Sencer Öztürk; Eray Tüzün

Agents in the Sandbox: End-to-End Crash Bug Reproduction for Minecraft

Eray Yapağcı, Yavuz Alp Sencer Öztürk, Eray Tüzün

TL;DR

This paper tackles automated crash bug reproduction in Minecraft by introducing BugCraft, an end-to-end framework that converts unstructured bug reports into reproducible in-game crashes through a two-stage pipeline: Step Synthesizer for plan generation and knowledge augmentation, and an Action Model that executes steps via a custom macro API and vision-based reasoning. It debuts BugCraft-Bench, a dataset of 86 confirmed crash reports, and demonstrates end-to-end reproduction at 30.2% with GPT-4o and 34.9% with GPT-4.1, outperforming baselines and highlighting tangible gains in efficiency and scalability for game testing. The work highlights the potential of LLM-driven agents to automate complex, interactive bug reproduction while clearly identifying bottlenecks in action execution and plan fidelity, guiding future enhancements in game-aware reasoning, robust APIs, and broader applicability beyond Minecraft. By releasing code, logs, and the BugCraft-Bench, the authors lay groundwork for further research in automated game bug reproduction and high-throughput bug triage.

Abstract

Reproducing game bugs, particularly crash bugs in continuously evolving games like Minecraft, is a notoriously manual, time-consuming, and challenging process to automate; insights from a key decision maker from Minecraft we interviewed confirm this, highlighting that a substantial portion of crash reports necessitate manual scenario reconstruction. Despite the success of LLM-driven bug reproduction in other software domains, games, with their complex interactive environments, remain largely unaddressed. This paper introduces BugCraft, a novel end-to-end framework designed to automate the reproduction of crash bugs in Minecraft directly from user-submitted bug reports, addressing the critical gap in automated game bug reproduction. BugCraft employs a two-stage approach: first, a Step Synthesizer leverages LLMs and Minecraft Wiki knowledge to transform bug reports into high-quality, structured steps to reproduce (S2R). Second, an Action Model, powered by a vision-based LLM agent and a custom macro API, executes these S2R steps within Minecraft to trigger the reported crash. To facilitate evaluation, we introduce BugCraft-Bench, a curated dataset of Minecraft crash bug reports. On BugCraft-Bench, our framework end-to-end reproduced 34.9% of crash bugs with GPT-4.1, outperforming baseline computer-use models by 37%. BugCraft demonstrates the feasibility of automated reproduction of crash bugs in complex game environments using LLMs, opening promising avenues for game testing and development. Finally, we make our code open at https://bugcraft2025.github.io

Agents in the Sandbox: End-to-End Crash Bug Reproduction for Minecraft

TL;DR

Abstract

Agents in the Sandbox: End-to-End Crash Bug Reproduction for Minecraft

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)