Table of Contents
Fetching ...

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou

Abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Abstract

Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.

Paper Structure

This paper contains 120 sections, 3 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: GameWorld covers 34 diverse games with 170 tasks for standardized evaluation of game agents.
  • Figure 2: Overview of the GameWorld benchmark with four modules: (i) MLLMs as game agents, (ii) Browser-based sandbox environment, (iii) Games & tasks library, and (iv) Outcome-based state-verifiable evaluation. This closes a continuous and interactive observation-action-verification loop for systematically evaluating game agents.
  • Figure 3: Per-game progress heatmap across the GameWorld benchmark. Rows correspond to 18 evaluated game agents with model: (a) Claude-Sonnet-4.6, (b) Gemini-2.5-Computer-Use, (c) OpenAI-Computer-Use, (d) Qwen3-VL-Plus, (e) Seed-1.8, (f) Qwen3-VL-235B-A22B, (g) Qwen3-VL-30B-A3B, (h) UI-TARS-1.5-7B, (i) Claude-Sonnet-4.6, (j) Gemini-3-Flash-Preview, (k) GLM-4.6V, (l) GPT-5.2, (m) Grok-4.1-Fast-Reasoning, (n) Kimi-K2.5, (o) Qwen3-VL-Plus, (p) Seed-1.8, (q) Qwen3-VL-235B-A22B, and (r) Qwen3-VL-30B-A3B. (a)-(h) are Computer-Use Agents and (i)-(r) are Generalist Multimodal Agents. Colors represent average task progress for each game from high (green) to medium (yellow) to low (red).
  • Figure 4: Per-game average progress across the 34 benchmark games for the four Qwen model--interface pairs used in the repeated-evaluation study. Each panel corresponds to one model--interface pair, and each horizontal bar shows the mean progress over the same ten full-benchmark reruns. Error bars denote one run-level standard deviation.
  • Figure 5: Capability-aligned five-level curriculum profiles across agent interfaces and models. Left: Generalist agents. Right: Computer-Use agents. Each radar axis corresponds to one curriculum level and values are average task progress of all the games in the level.
  • ...and 3 more figures