Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

Claire Jin; Sudha Rao; Xiangyu Peng; Portia Botchway; Jessica Quaye; Chris Brockett; Bill Dolan

Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

Claire Jin, Sudha Rao, Xiangyu Peng, Portia Botchway, Jessica Quaye, Chris Brockett, Bill Dolan

TL;DR

The paper tackles the problem of logical and game-balance bugs in LLM-powered text-based games by introducing a two-stage, log-driven bug-detection pipeline. It aligns gameplay logs to a designer-defined progression graph (scenarios and scenes) and then aggregates across players to identify bottlenecks and likely bug causes, using GPT-4 for all reasoning. Validated on DejaBoom!, with 28 player logs, the method identifies bottleneck scenes and bug classes, and ablation studies show superiority over naive baselines. The approach yields objective, quantitative insights into game parts and supports scalable bug detection, offering potential for automatic game adaptation and broader deployment beyond the tested title. Limitations include dependence on the GPT-4 model and English-language data, with future work aimed at more complex/multimodal games and multilingual applicability.

Abstract

Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws.

Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

TL;DR

Abstract

Paper Structure (26 sections, 5 figures, 3 tables)

This paper contains 26 sections, 5 figures, 3 tables.

Introduction
DejaBoom! LLM-powered Game
Methods
Stage 1: Structured alignment and summarization of gameplay progression
Stage 2: Identify pain points and causes
Results
Identifying progression bottlenecks
Identifying common causes of bottlenecks
Ablation Studies
Conclusion
Appendix
Dejaboom! Expanded
Sample Game Logs
Raw Game Log
Cleaned Game Log
...and 11 more sections

Figures (5)

Figure 1: Our automated bug detection procedure.
Figure 2: DejaBoom logic graph. Nodes represent scenes and colored groups indicate scenarios. Arrows indicate order of completion. Merging arrows: only one of the tail nodes is required to proceed to the head node. Dotted arrows: a new location, NPC, or item should be unlocked.
Figure 3: (a) A gameplay step from a section inputted to the LLM. (b) LLM-generated summary for (a).
Figure 4: (a) Completion rate per scene. Black brackets group scenes forming a scenario. Stars mark potential pain point scenes. (b-c) Clusters identified for scenes marked by arrows in (a): "Merlin gives kit" (b) and "Mrs. T reveals Hatter in Park" (c).
Figure 5: Dejaboom! game layout. A map of the village where the game takes place, showing the locations, objects, and NPCs. The player begins the game from home and their goal is to diffuse the bomb before it explodes again.

Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

TL;DR

Abstract

Automatic Bug Detection in LLM-Powered Text-Based Games Using LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (5)