Table of Contents
Fetching ...

Code World Models for General Game Playing

Wolfgang Lehrach, Daniel Hennes, Miguel Lazaro-Gredilla, Xinghua Lou, Carter Wendelken, Zun Li, Antoine Dedieu, Jordi Grau-Moya, Marc Lanctot, Atil Iscen, John Schultz, Marcus Chiam, Ian Gemp, Piotr Zielinski, Satinder Singh, Kevin P. Murphy

TL;DR

The paper tackles the limitations of LLM-based policies in game playing by introducing Code World Models, executable Python simulations synthesized from rules and trajectories. CWMs couple rule-verification with planning-friendly interfaces (MCTS/ISMCTS) and augment planning with learned value and inference functions, enabling planning-driven play even in imperfect information settings. Across 10 two-player games, including novel OOD variants, CWMs consistently match or outperform a strong policy-based baseline (Gemini 2.5 Pro) and rival a ground-truth planner in many cases, with notable gains in perfect-information settings and robust performance under partial observability. The approach demonstrates verifiability, strategic depth, and generalization by decomposing data-to-code translation from policy generation, offering a scalable path to general game playing with LLMs and classical planners. The results also reveal current limits (notably Gin Rummy in closed-deck mode) and suggest future work on online learning and open-world tasks to broaden applicability and resilience.

Abstract

Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach -- involving prompting for direct move generation -- has significant drawbacks. It relies on the model's implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model -- comprising functions for state transition, legal move enumeration, and termination checks -- serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.

Code World Models for General Game Playing

TL;DR

The paper tackles the limitations of LLM-based policies in game playing by introducing Code World Models, executable Python simulations synthesized from rules and trajectories. CWMs couple rule-verification with planning-friendly interfaces (MCTS/ISMCTS) and augment planning with learned value and inference functions, enabling planning-driven play even in imperfect information settings. Across 10 two-player games, including novel OOD variants, CWMs consistently match or outperform a strong policy-based baseline (Gemini 2.5 Pro) and rival a ground-truth planner in many cases, with notable gains in perfect-information settings and robust performance under partial observability. The approach demonstrates verifiability, strategic depth, and generalization by decomposing data-to-code translation from policy generation, offering a scalable path to general game playing with LLMs and classical planners. The results also reveal current limits (notably Gin Rummy in closed-deck mode) and suggest future work on online learning and open-world tasks to broaden applicability and resilience.

Abstract

Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach -- involving prompting for direct move generation -- has significant drawbacks. It relies on the model's implicit fragile pattern-matching capabilities, leading to frequent illegal moves and strategically shallow play. Here we introduce an alternative approach: We use the LLM to translate natural language rules and game trajectories into a formal, executable world model represented as Python code. This generated model -- comprising functions for state transition, legal move enumeration, and termination checks -- serves as a verifiable simulation engine for high-performance planning algorithms like Monte Carlo tree search (MCTS). In addition, we prompt the LLM to generate heuristic value functions (to make MCTS more efficient), and inference functions (to estimate hidden states in imperfect information games). Our method offers three distinct advantages compared to directly using the LLM as a policy: (1) Verifiability: The generated CWM serves as a formal specification of the game's rules, allowing planners to algorithmically enumerate valid actions and avoid illegal moves, contingent on the correctness of the synthesized model; (2) Strategic Depth: We combine LLM semantic understanding with the deep search power of classical planners; and (3) Generalization: We direct the LLM to focus on the meta-task of data-to-code translation, enabling it to adapt to new games more easily. We evaluate our agent on 10 different games, of which 4 are novel and created for this paper. 5 of the games are fully observed (perfect information), and 5 are partially observed (imperfect information). We find that our method outperforms or matches Gemini 2.5 Pro in 9 out of the 10 considered games.

Paper Structure

This paper contains 67 sections, 9 figures, 24 tables.

Figures (9)

  • Figure 1: Evolution of the transition and inference accuracy with the number of LLM calls for imperfect games with refinement via tree search and hidden history inference.
  • Figure 2: W/L/D rates for game play between CWM-MCTS and three opponents. CWMs are refined via tree search and hidden history inference.
  • Figure 3: W/L/D rates and payoff distributions for game play between CWM-ISMCTS and three opponents. CWMs are refined via tree search and hidden history inference, open deck.
  • Figure 4: W/L/D rates and payoff distributions for game play between CWM-ISMCTS and three opponents. CWMs refined via tree search and hidden history inference, closed deck.
  • Figure 5: ISMCTS. A search tree is built over possible ground truth histories (e.g. $h_1$, $h_2$, …). Because the player cannot distinguish between certain histories, statistics are aggregated at the level of information sets (dotted boxes), which group all histories that appear identical to the player.
  • ...and 4 more figures