Table of Contents
Fetching ...

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum

TL;DR

The paper addresses the inadequacy of narrow, static AI benchmarks for evaluating general intelligence and proposes the Multiverse of Human Games as a comprehensive testbed. It introduces the AI GameStore, a scalable pipeline that uses LLMs and humans-in-the-loop to source, generate, refine, annotate, and evaluate human-designed games drawn from real marketplaces, creating a living, open-ended evaluation framework. In a proof-of-concept, 100 AI GameStore games were tested with seven frontier vision-language models and 106 humans, revealing that current models achieve only a small fraction of human performance and struggle with long-horizon planning, memory, and world-model learning. The work lays out concrete steps to expand game diversity, automate generation, and deepen cognitive diagnostics, aiming to drive progress toward human-like general intelligence in AI systems.

Abstract

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

TL;DR

The paper addresses the inadequacy of narrow, static AI benchmarks for evaluating general intelligence and proposes the Multiverse of Human Games as a comprehensive testbed. It introduces the AI GameStore, a scalable pipeline that uses LLMs and humans-in-the-loop to source, generate, refine, annotate, and evaluate human-designed games drawn from real marketplaces, creating a living, open-ended evaluation framework. In a proof-of-concept, 100 AI GameStore games were tested with seven frontier vision-language models and 106 humans, revealing that current models achieve only a small fraction of human performance and struggle with long-horizon planning, memory, and world-model learning. The work lays out concrete steps to expand game diversity, automate generation, and deepen cognitive diagnostics, aiming to drive progress toward human-like general intelligence in AI systems.

Abstract

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
Paper Structure (28 sections, 1 equation, 14 figures, 2 tables)

This paper contains 28 sections, 1 equation, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Many games are abstractions of real-world activities. They are inspired by diverse and concrete activities in human enterprise, and they prepare agents to adapt to similar problems that arise in the real-world.
  • Figure 2: Comparison between different space of games discussed in the paper. The yellow rectangle represents the full space of all computable games. In the paper, we introduce the Multiverse of Human games (orange), which collectively demand a large space of cognitive capabilities that are found in average humans. We argue that this space is a good proxy for human-like general intelligence. Then the space of all digital games on gaming platform (green) covers a subset of that space. Among the these digital games, AI GameStore (blue) aims to sample from all digital games but the initial 100 games only cover a small restricted space.
  • Figure 3: The AI GameStore pipeline consists of four core stages: a) Game Sourcing and Selection: Popular games are harvested from digital marketplaces (Apple App Store and Steam) and filtered based on player ratings and reviews. An LLM then scores these candidates against specific suitability criteria—such as playability within minutes and the ability to produce quantifiable metrics—to identify the most viable games for adaptation. b) Game Generation and Refinement: Using game descriptions and requirements, an LLM generates an initial game (Version 0). This version undergoes automated refinement via simulated play and error-checking, followed by human-in-the-loop refinement, where human participants play the game and give feedback to improve the game until it's fun and playable. This process generates a base game that corresponds to the original game and novel variants with modified or added mechanics. c) Game Annotations and Profiling: The final generated games are played by humans who annotate them based on a rubric of cognitive capabilities (e.g., planning, working memory, and reasoning). These annotations enable in-depth analysis on AI models' cognitive capabilities. d) Model Evaluation: AI models and human players interact with the games through a standardized interface. We then compute models' performance normalized against humans' and perform capability analysis.
  • Figure 4: Examples of popular digital games on the Apple App Store and their adapted AI GameStore versions. We present four example games, capturing diverse genre and cognitive capability. The top half of each example shows a pair of original game and the its corresponding version on the AI GameStore. The bottom half shows the annotations of the cognitive demand for the game. The games and example play videos can be accessed on http://aigamestore.org. (VP = Visual Processing; ST = Spatial-temporal Coordination; ME = Memory; PL = Planning; WM = World Model Learning; PH = Physical Reasoning; SO = Social Reasoning.
  • Figure 5: Performance score (top) and runtime comparison (bottom) between human players and VLMs on 100 games. We normalized all model scores against human median scores for each game (i.e. human median = 100), and then report the geometric mean of normalized scores across 100 games. The best scoring model, GPT-5.2, reaches only 8.5 out of 100 on the human-relative scale. Additionally, humans play each game for 120s, whereas models are significantly slower to complete 120 API calls, requiring more than 10 times longer ($>1200s$) to finish most games, and averaging around 12-18 times longer. Error bars indicate 95 % confidence intervals.
  • ...and 9 more figures