Table of Contents
Fetching ...

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer

TL;DR

VideoGameQA-Bench presents a comprehensive benchmark to evaluate vision-language models on video game quality assurance tasks, addressing a critical gap where automated QA lags behind game development needs. By combining real-world and Unity-generated data across nine tasks (image and video), the study reveals that current VLMs struggle with fine-grained scene understanding and visual regression, though they show promise in glitch detection and bug-report generation. The work provides detailed protocols, ground-truth schemas, and an LLM-aided evaluation approach, highlighting both the potential and limitations of current models for automating game QA. Overall, the benchmark offers a valuable platform to drive progress in automated visual QA for video games and informs practical deployment considerations.

Abstract

With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

TL;DR

VideoGameQA-Bench presents a comprehensive benchmark to evaluate vision-language models on video game quality assurance tasks, addressing a critical gap where automated QA lags behind game development needs. By combining real-world and Unity-generated data across nine tasks (image and video), the study reveals that current VLMs struggle with fine-grained scene understanding and visual regression, though they show promise in glitch detection and bug-report generation. The work provides detailed protocols, ground-truth schemas, and an LLM-aided evaluation approach, highlighting both the potential and limitations of current models for automating game QA. Overall, the benchmark offers a valuable platform to drive progress in automated visual QA for video games and informs practical deployment considerations.

Abstract

With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/

Paper Structure

This paper contains 59 sections, 2 equations, 77 figures, 16 tables.

Figures (77)

  • Figure 1: Sample tasks from VideoGameQA-Bench. (a) A unit test where the model should verify small details in the image, such as character's orientation and background. (b) A visual regression test where the model should detect unacceptable changes between two versions of the same scene. (c) A UI unit test in which the model must visually verify user interface components, such as a chemistry graph between players. (d) A bug report generation task where the model needs to generate a bug report for a glitch. (e) Two glitch detection tasks, where the model must identify visual anomalies, such as unnatural body configuration (left) or object clipping (right, fingers clipping the apple). (f) Two glitch detection tasks, where the model is required to verify the glitch-free status of images with intentional object clipping and high scene complexity. (g) A parametric test that evaluates whether the model can detect clipping at various object proximities. (h) A needle-in-a-haystack task, which requires the model to identify the first frame in which a glitch occurs.
  • Figure 2: Samples from challenging cases that most VLMs consistently struggle with. (a) Failure to understand spatial reasoning, such as object orientation (whether an airplane is facing toward the camera or away). (b) Failure to read UIs with complex layouts and objects arranged in grids. (c) Failure to detect common-sense inconsistencies, such as a missing gun in the hand. (d) Failure to detect unnatural body configurations. (e) Failure to detect missing foreground objects (candles). (f) Failure to detect and analyze object movement such as shaking or bouncing.
  • Figure A1: We use Gemini-2.5-Pro to draft an initial visual unit test based on an existing image.
  • Figure A2: We use Gemini-2.5-Pro to draft an initial UI unit test based on an existing image.
  • Figure A3: The default prompt associated with each image in the dataset for the image-based glitch detection task.
  • ...and 72 more figures