VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance
Mohammad Reza Taesiri, Abhijay Ghildyal, Saman Zadtootaghaj, Nabajeet Barman, Cor-Paul Bezemer
TL;DR
VideoGameQA-Bench presents a comprehensive benchmark to evaluate vision-language models on video game quality assurance tasks, addressing a critical gap where automated QA lags behind game development needs. By combining real-world and Unity-generated data across nine tasks (image and video), the study reveals that current VLMs struggle with fine-grained scene understanding and visual regression, though they show promise in glitch detection and bug-report generation. The work provides detailed protocols, ground-truth schemas, and an LLM-aided evaluation approach, highlighting both the potential and limitations of current models for automating game QA. Overall, the benchmark offers a valuable platform to drive progress in automated visual QA for video games and informs practical deployment considerations.
Abstract
With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector's sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry's most labor-intensive processes with limited automation options. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain. To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. Code and data are available at: https://asgaardlab.github.io/videogameqa-bench/
