Table of Contents
Fetching ...

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, Cor-Paul Bezemer

TL;DR

The study interrogates whether zero-shot large language models can detect bugs in video game event sequences by reframing bug detection as a QA task. It introduces the GameBugDescriptions dataset (167 buggy videos, 334 QA pairs across 8 games) and evaluates six models from OPT and InstructGPT families using a multi-stage prompting strategy. Results show the best model achieving up to 70.66% accuracy in buggy event identification and 44.01% in bug-type classification, demonstrating both promise and limitations for automated game testing. The work provides a new, challenging out-of-distribution benchmark and releases code and data to spur progress in AI-assisted video game testing.

Abstract

Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

TL;DR

The study interrogates whether zero-shot large language models can detect bugs in video game event sequences by reframing bug detection as a QA task. It introduces the GameBugDescriptions dataset (167 buggy videos, 334 QA pairs across 8 games) and evaluates six models from OPT and InstructGPT families using a multi-stage prompting strategy. Results show the best model achieving up to 70.66% accuracy in buggy event identification and 44.01% in bug-type classification, demonstrating both promise and limitations for automated game testing. The work provides a new, challenging out-of-distribution benchmark and releases code and data to spur progress in AI-assisted video game testing.

Abstract

Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs
Paper Structure (35 sections, 8 figures, 4 tables)

This paper contains 35 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example of using a large language model to detect a video game bug by classifying a sequence of events in the Grand Theft Auto V video game in which a collision between a plane and parachute cords leads to the plane losing its right wing. The highlighted text shows the response of the davinci model from the InstructGPT family.
  • Figure 2: Examples of the game knowledge of LLMs.
  • Figure 3: Distribution of bug types across games in the GameBugDescriptions dataset.
  • Figure :
  • Figure :
  • ...and 3 more figures