Table of Contents
Fetching ...

Assessing Adaptive World Models in Machines with Novel Games

Lance Ying, Katherine M. Collins, Prafull Sharma, Cedric Colas, Kaiya Ivy Zhao, Adrian Weller, Zenna Tavares, Phillip Isola, Samuel J. Gershman, Jacob D. Andreas, Thomas L. Griffiths, Francois Chollet, Kelsey R. Allen, Joshua B. Tenenbaum

TL;DR

This work argues that human-like rapid adaptation hinges on adaptive world models and introduces world model induction as a core capacity. It proposes a novel-game benchmark framework grounded in cognitive-science principles, using hierarchical world-model induction and active exploration to assess how quickly and robustly AI systems build and revise internal environment models. The paper details desiderata for designing such games, outlines a generative framework to continually produce novel challenges, and presents metrics for evaluating both behavioral performance and the underlying world models. If adopted, this paradigm could drive progress toward AI with improved sample efficiency, generalization, and human-like adaptability, contributing to the quest for artificial general intelligence.

Abstract

Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.

Assessing Adaptive World Models in Machines with Novel Games

TL;DR

This work argues that human-like rapid adaptation hinges on adaptive world models and introduces world model induction as a core capacity. It proposes a novel-game benchmark framework grounded in cognitive-science principles, using hierarchical world-model induction and active exploration to assess how quickly and robustly AI systems build and revise internal environment models. The paper details desiderata for designing such games, outlines a generative framework to continually produce novel challenges, and presents metrics for evaluating both behavioral performance and the underlying world models. If adopted, this paradigm could drive progress toward AI with improved sample efficiency, generalization, and human-like adaptability, contributing to the quest for artificial general intelligence.

Abstract

Human intelligence exhibits a remarkable capacity for rapid adaptation and effective problem-solving in novel and unfamiliar contexts. We argue that this profound adaptability is fundamentally linked to the efficient construction and refinement of internal representations of the environment, commonly referred to as world models, and we refer to this adaptation mechanism as world model induction. However, current understanding and evaluation of world models in artificial intelligence (AI) remains narrow, often focusing on static representations learned from training on massive corpora of data, instead of the efficiency and efficacy in learning these representations through interaction and exploration within a novel environment. In this Perspective, we provide a view of world model induction drawing on decades of research in cognitive science on how humans learn and adapt so efficiently; we then call for a new evaluation framework for assessing adaptive world models in AI. Concretely, we propose a new benchmarking paradigm based on suites of carefully designed games with genuine, deep and continually refreshing novelty in the underlying game structures -- we refer to this class of games as novel games. We detail key desiderata for constructing these games and propose appropriate metrics to explicitly challenge and evaluate the agent's ability for rapid world model induction. We hope that this new evaluation framework will inspire future evaluation efforts on world models in AI and provide a crucial step towards developing AI systems capable of human-like rapid adaptation and robust generalization -- a critical component of artificial general intelligence.

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Framework for characterizing world models across different levels of abstractions.a). World models within a hierarchical Bayesian framework. The structured probabilistic model $\Omega_1$ (ad-hoc world model) generates expectations about possible observations $e$, while abstract knowledge and principles (abstract world model) $\Omega_2, \Omega_3,\dots$ generate the space of possible structures for $\Omega_1$. Each level of abstraction generates hypotheses and probability distributions that support learning at the level below. Then, given observations $e$, the learner can update its world models by inverting the generative model. Figure adapted from tenenbaum2006theory. b). The hierarchical structure of games is analogous to many aspects of the human world model hierarchy. The world model learned at each game level can be ad-hoc and specific. For example, certain game-specific knowledge learned from the first game level (e.g., the effect of consuming a mushroom in Super Mario) can be generalized to the next level (within-domain generalization). On the other hand, meta-learning enables agents to learn domain-general principles at higher levels of abstraction. For example, mastering one platformer game can enable a player to quickly learn to play a different platformer game (cross-domain generalization).
  • Figure 2: Figure 2. Case studies of novel environments for testing world model induction a)https://arcprize.org/arc-agiarc2025v3 is an interactive reasoning benchmark. Players are not given instructions about the gameplay or win conditions; instead, they must infer game rules and objectives through interaction. In the featured example, players can move the tiles horizontally. The objective is to align the tiles so that the cells matching the standalone cell at the bottom—the yellow cell, in this case—are collinear. b)https://autumn.basis.ai/basis2025autumn evaluates an agent's ability to discover latent mechanics through interactive exploration of grid-world environments. The evaluation follows a two-phase protocol: interaction and test. During interaction, agents explore environments freely without rewards or goals. The subsequent test phase evaluates their understanding through three tasks: masked frame prediction, defect detection, and planning. In the featured example, through interaction the agent is expected to discover the rule that adding a grey block lowers the ballon, which makes the right-most frame an anomaly. c)https://pedrotsividis.com/vgdl-games/schaul2013videoperez2019general were used by tsividis2021human to evaluate an agent's capability to discover rules and objectives in ambiguous environments when no language instructions are provided. In their games, the agent can press a few keys on a keyboard to explore the environment to pass each level. Humans generally learn to play these new games in a matter of minutes. d)https://sites.google.com/view/virtualtoolsgameallen2020rapid tests rapid learning in physical reasoning scenarios. The goal is to select and drop one of the tools on the right so that the red ball ends up in the green bin. allen2020rapid show that humans quickly solve these tasks by leveraging their intuitive understanding of physics. This showcases robust cross-domain generalization.
  • Figure 3: Evaluation methods used by tsividis2021human for comparing learning behaviors across humans and models (EMPA and DDQN). a) Quantitative learning efficiencies plotted as the number of levels passed as a function of exploration steps. b) Qualitative analysis of different exploration patterns demonstrated by different learning agents.
  • Figure 4: Creating novel games. There are many ways that researchers can create novel games. Variants could be sourced from existing games by modifying the environment and number of players, mechanics, or objective function. Novel games could also be formed by combining existing games.