Table of Contents
Fetching ...

Evaluating the World Model Implicit in a Generative Model

Keyon Vafa, Justin Y. Chen, Ashesh Rambachan, Jon Kleinberg, Sendhil Mullainathan

TL;DR

The paper asks how to evaluate whether generative sequence models learn coherent world models, framing the problem with deterministic finite automata (DFA) and Myhill-Nerode theory. It shows that typical next-token diagnostics can overstate world-model fidelity and introduces two model-agnostic metrics—sequence compression and sequence distinction—grounded in the DFA boundary to better assess recovery. Through NYC taxi-route data, Othello, and logic puzzles, the authors demonstrate that models can perform exceptionally on traditional metrics while their inferred world models remain incoherent, leading to fragility under detours or task shifts. The work provides a principled evaluation framework and a public benchmark, underscoring the need to generalize beyond DFAs to capture more complex underlying structures in real-world domains.

Abstract

Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead to failures. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.

Evaluating the World Model Implicit in a Generative Model

TL;DR

The paper asks how to evaluate whether generative sequence models learn coherent world models, framing the problem with deterministic finite automata (DFA) and Myhill-Nerode theory. It shows that typical next-token diagnostics can overstate world-model fidelity and introduces two model-agnostic metrics—sequence compression and sequence distinction—grounded in the DFA boundary to better assess recovery. Through NYC taxi-route data, Othello, and logic puzzles, the authors demonstrate that models can perform exceptionally on traditional metrics while their inferred world models remain incoherent, leading to fragility under detours or task shifts. The work provides a principled evaluation framework and a public benchmark, underscoring the need to generalize beyond DFAs to capture more complex underlying structures in real-world domains.

Abstract

Recent work suggests that large language models may implicitly learn world models. How should we assess this possibility? We formalize this question for the case where the underlying reality is governed by a deterministic finite automaton. This includes problems as diverse as simple logical reasoning, geographic navigation, game-playing, and chemistry. We propose new evaluation metrics for world model recovery inspired by the classic Myhill-Nerode theorem from language theory. We illustrate their utility in three domains: game playing, logic puzzles, and navigation. In all domains, the generative models we consider do well on existing diagnostics for assessing world models, but our evaluation metrics reveal their world models to be far less coherent than they appear. Such incoherence creates fragility: using a generative model to solve related but subtly different tasks can lead to failures. Building generative models that meaningfully capture the underlying logic of the domains they model would be immensely valuable; our results suggest new ways to assess how close a given model is to that goal.
Paper Structure (24 sections, 2 theorems, 11 equations, 24 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 2 theorems, 11 equations, 24 figures, 8 tables, 1 algorithm.

Key Result

Proposition 2.3

A generative model $m(\cdot)$ recovers the DFA $W$ if and only if it satisfies exact next-token prediction under the DFA $W$.

Figures (24)

  • Figure 1: On the left, a visual depiction of a Myhill-Nerode boundary and interior. On the right, examples of two states for cumulative Connect-4. Both states have the same set of valid next moves. The shortest sequence in the Myhill-Nerode boundary has length 4, and the boundary contains sequences up to length 30. The interior contains approximately $8.8 \times 10^{27}$ sequences of length 29 that do not distinguish the two boards.
  • Figure 2: A visual depiction of our two evaluation metrics. A compression error is a model failing to recognize that two sequences that result in the same state should accept the same suffixes. A distinction error is a model failing to find the right distinguishing suffixes for two sequences that lead to different states. Our metrics measure errors at the boundary, which are visually depicted above.
  • Figure 3: Reconstructed maps of Manhattan from sequences produced by three models: the true world model (left), the true world model corrupted with noise (middle), and a transformer trained on random walks (right). Edges exit nodes in their specified cardinal direction. In the zoomed-in images, edges belonging to the true graph are black and false edges added by the reconstruction algorithm are red. We host interactive reconstructed maps from transformers at the following links: https://manhattan-reconstruction-shortest.netlify.app/, https://manhattan-reconstruction-noisy.netlify.app/, and https://manhattan-reconstruction-noisy.netlify.app/.
  • Figure 4: On the left, an example given to large language models to assess task capabilities. On the right, each model's average task performance along with their results on our proposed metrics. Models are very capable of solving logic puzzles despite not having a coherent world model.
  • Figure 5: Examples of data and traversals. On the left are examples of sequences seen during training and contexts used for evaluation. On the right is an example traversal generated by a transformer trained on shortest paths data
  • ...and 19 more figures

Theorems & Definitions (8)

  • Definition 2.1
  • Proposition 2.3
  • Definition 2.4
  • Definition 2.5
  • Definition 2.6
  • proof : Proof of \ref{['prop: DFA recovery iff exact NTP']}
  • Definition C.1: Equivalent sequences
  • Theorem C.2: myhill1957finitenerode1958linear