Table of Contents
Fetching ...

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Shufan Jiang, Chios Chen, Zhiyang Chen

Abstract

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Abstract

The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and introduce the Game Benchmark for Quality Assurance (GBQA), a benchmark containing 30 games and 124 human-verified bugs across three difficulty levels, to evaluate whether LLMs can autonomously detect software bugs. The benchmark is constructed using a multi-agent system that develops games and injects bugs in a scalable manner, with human experts in the loop to ensure correctness. Moreover, we provide a baseline interactive agent equipped with a multi-round ReAct loop and a memory mechanism, enabling long-horizon exploration of game environments for bug detection across different LLMs. Extensive experiments on frontier LLMs demonstrate that autonomous bug discovery remains highly challenging: the best-performing model, Claude-4.6-Opus in thinking mode, identifies only 48.39% of the verified bugs. We believe GBQA provides an adequate testbed and evaluation criterion, and that further progress on it will help close the gap in autonomous software engineering.

Paper Structure

This paper contains 57 sections, 1 equation, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Evolution of the software development paradigm in the LLM era. (a) Traditional human-driven iterative workflow. (b) Human--LLM collaborative coding, where a coding agent assists development under human supervision. (c) Toward a fully autonomous coding system which can generate code, detect bugs and fix them without human-in-the-loop. While existing benchmarks primarily focus on code generation and fixing, our benchmark emphasizes autonomous bug discovery and quality assurance part within the development cycle.
  • Figure 2: Overview of GBQA25, 24, 5925, 24, 59. Dataset is constructed using a multi-agent game builder that generates 30 game environments with 124 implanted bugs, which are annotated and categorized into three difficulty levels (Easy, Medium, Hard) by human QA experts. During evaluation, a QA agent autonomously interacts with the game environment through ReAct loops, and produces structured bug reports. Then, a critic agent verifies reported bugs by matching them against human-annotated ground truth to compute quantitative metrics.
  • Figure 2: Inter-Annotator Agreement analysis for human annotation in bug classification.
  • Figure 3: Percentage of bug discovery by difficulty level across step budgets. Easy bugs are largely discovered within the first 300 steps, while hard bugs require substantially more interaction steps and remain growing even at nearly 500 steps.
  • Figure 4: Ablation study of memory module. Each cluster corresponds to a session, and vertical arrows indicate performance gains as the step budget increases. The four trend lines illustrate the aggregated trend for same memory settings across sessions.
  • ...and 3 more figures