Table of Contents
Fetching ...

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang, Arnav Yayavaram, Siddharth Yayavaram, Seth Karten, Qiuhong Anna Wei, Runkun Chen, Alexander Wang, Valerie Chen, Ameet Talwalkar, Chris Donahue

TL;DR

GameDevBench introduces the first benchmark for evaluating agentic capabilities in game development within the Godot engine, emphasizing multimodal understanding and integration of code, assets, and GUI editing. It builds 132 tasks from online tutorials through a four-stage pipeline (data preparation, automatic task construction, task refinement, and human annotation), with deterministic tests derived from Godot to enable verifiable evaluation. Across models, agents struggle on most tasks, achieving up to roughly the mid-50s percent success for the best performers and showing clear drops for tasks with higher multimodal demands (46.9% gameplay vs 31.6% 2D graphics). Two simple multimodal feedback mechanisms—editor screenshots via a Model Context Protocol and runtime video—consistently boost performance, underscoring the importance of multimodal signals, and the work is publicly released to catalyze future research in agentic game development.

Abstract

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

GameDevBench: Evaluating Agentic Capabilities Through Game Development

TL;DR

GameDevBench introduces the first benchmark for evaluating agentic capabilities in game development within the Godot engine, emphasizing multimodal understanding and integration of code, assets, and GUI editing. It builds 132 tasks from online tutorials through a four-stage pipeline (data preparation, automatic task construction, task refinement, and human annotation), with deterministic tests derived from Godot to enable verifiable evaluation. Across models, agents struggle on most tasks, achieving up to roughly the mid-50s percent success for the best performers and showing clear drops for tasks with higher multimodal demands (46.9% gameplay vs 31.6% 2D graphics). Two simple multimodal feedback mechanisms—editor screenshots via a Model Context Protocol and runtime video—consistently boost performance, underscoring the importance of multimodal signals, and the work is publicly released to catalyze future research in agentic game development.

Abstract

Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.
Paper Structure (27 sections, 11 figures, 3 tables)

This paper contains 27 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: We present GameDevBench, a benchmark for evaluating an agent's ability to solve complex and multimodal game development tasks in a modern game engine.
  • Figure 2: This is an example task from GameDevBench that requests for the creation of a UI minimap. Top is the visual GUI representation and highlighted points of interest. Bottom is the same scenes and files represented in code. Tasks can be solved via the editor or entirely through code although either method requires understanding multimodal assets. Game development tasks are complex and require editing dense files, identifying and visually understanding various assets, and navigating various nodes (game elements) and scenes (a collection of nodes).
  • Figure 3: Types of editors in Godot. Top-left is the scene editor. Top-right is the script editor. The bottom contains various contextual editors. From left to right: tilemap, shader, animation, and audio editors. Contextual editors surface depending on use case. Typically, tasks that use contextual editors require deeper multi-modal understanding.
  • Figure 4: GameDevBench features a diverse amount of filetypes (27 different types, left). The vast majority of tasks contain either images, resources (e.g., Shaders), or multiple asset types (middle). Each task contains multiple scripts and scenes, both of which are context-rich and require a significant amount of tokens to process (right).
  • Figure 5: In general, agents perform better on tasks that require skills focusing on gameplay functionality compared to tasks that require multimodal understanding such as 2D and 3D graphics tasks. Performance on editor categories is dependent on model performance. Stronger models (left 4 agents) tend to perform similarly across all editor types, while weaker models (right 3 agents) tend to perform worse on tasks requiring the scene and contextual editors. All success rates are taken from results where the agent has access to multimodal feedback.
  • ...and 6 more figures