Table of Contents
Fetching ...

GlitchBench: Can large multimodal models detect video game glitches?

Mohammad Reza Taesiri, Tianjun Feng, Anh Nguyen, Cor-Paul Bezemer

TL;DR

GlitchBench addresses the challenge of evaluating large multimodal models on real-world, complex glitch-detection tasks derived from video games. The authors compile a two-part dataset (513 real glitch frames from community sources plus 75 Unity-generated glitches) and a 330-frame glitch-free baseline, spanning 205 games, to probe both visual perception and reasoning. Eleven state-of-the-art LMMs, including GPT-4V, are evaluated using free-form questions (unusual, wrong, and detailed description), with semantic judgments by a Llama-2-70B-Chat judge and supplementary human evaluation; GPT-4V achieves the best average score of 43.4% (Q1-Q2) while glitch-free captions reach up to 64.9% in Q3. The results reveal meaningful headroom (roughly 30–35%) for future models, show that higher image resolution improves performance, and expose systematic weaknesses such as difficulty with subtle glitches, facial reasoning, and multimodal hallucinations. The study argues that conventional multimodal benchmarks may not predict performance on real-world, reasoning-intensive tasks, highlighting the need for stress-tested, domain-rich evaluation data and prompt designs. GlitchBench thus provides a challenging benchmark to drive progress in robust multimodal perception and reasoning for real-world glitch-detection in games and beyond.

Abstract

Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap, we introduce GlitchBench, a novel benchmark derived from video game quality assurance tasks, to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/

GlitchBench: Can large multimodal models detect video game glitches?

TL;DR

GlitchBench addresses the challenge of evaluating large multimodal models on real-world, complex glitch-detection tasks derived from video games. The authors compile a two-part dataset (513 real glitch frames from community sources plus 75 Unity-generated glitches) and a 330-frame glitch-free baseline, spanning 205 games, to probe both visual perception and reasoning. Eleven state-of-the-art LMMs, including GPT-4V, are evaluated using free-form questions (unusual, wrong, and detailed description), with semantic judgments by a Llama-2-70B-Chat judge and supplementary human evaluation; GPT-4V achieves the best average score of 43.4% (Q1-Q2) while glitch-free captions reach up to 64.9% in Q3. The results reveal meaningful headroom (roughly 30–35%) for future models, show that higher image resolution improves performance, and expose systematic weaknesses such as difficulty with subtle glitches, facial reasoning, and multimodal hallucinations. The study argues that conventional multimodal benchmarks may not predict performance on real-world, reasoning-intensive tasks, highlighting the need for stress-tested, domain-rich evaluation data and prompt designs. GlitchBench thus provides a challenging benchmark to drive progress in robust multimodal perception and reasoning for real-world glitch-detection in games and beyond.

Abstract

Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap, we introduce GlitchBench, a novel benchmark derived from video game quality assurance tasks, to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/
Paper Structure (47 sections, 47 figures, 5 tables)

This paper contains 47 sections, 47 figures, 5 tables.

Figures (47)

  • Figure 1: The image depicts a screenshot in which it rains inside a room. While the rain should be what is wrong with the image, GPT-4V fails to reason correctly and instead focuses on the color of Batman's costume. Note that the ground truth is never presented as part of the prompt in our study.
  • Figure 2: Sample images from the GlitchBench showing glitches in various games with distinct styles. Samples (a)--(e) are captured from online videos, while sample (f) is generated inside the Unity game engine.
  • Figure 3: To evaluate a model's response, we ask a judge (the Llama-2-70b-Chat model) to compare it semantically with the ground truth.
  • Figure 4: The performance of all tested models on different categories of images in GlitchBench.
  • Figure 5: One of the several cases in which GPT-4V fails to detect a problem with facial features.
  • ...and 42 more figures