GlitchBench: Can large multimodal models detect video game glitches?
Mohammad Reza Taesiri, Tianjun Feng, Anh Nguyen, Cor-Paul Bezemer
TL;DR
GlitchBench addresses the challenge of evaluating large multimodal models on real-world, complex glitch-detection tasks derived from video games. The authors compile a two-part dataset (513 real glitch frames from community sources plus 75 Unity-generated glitches) and a 330-frame glitch-free baseline, spanning 205 games, to probe both visual perception and reasoning. Eleven state-of-the-art LMMs, including GPT-4V, are evaluated using free-form questions (unusual, wrong, and detailed description), with semantic judgments by a Llama-2-70B-Chat judge and supplementary human evaluation; GPT-4V achieves the best average score of 43.4% (Q1-Q2) while glitch-free captions reach up to 64.9% in Q3. The results reveal meaningful headroom (roughly 30–35%) for future models, show that higher image resolution improves performance, and expose systematic weaknesses such as difficulty with subtle glitches, facial reasoning, and multimodal hallucinations. The study argues that conventional multimodal benchmarks may not predict performance on real-world, reasoning-intensive tasks, highlighting the need for stress-tested, domain-rich evaluation data and prompt designs. GlitchBench thus provides a challenging benchmark to drive progress in robust multimodal perception and reasoning for real-world glitch-detection in games and beyond.
Abstract
Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap, we introduce GlitchBench, a novel benchmark derived from video game quality assurance tasks, to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/
