Table of Contents
Fetching ...

Revisiting the Othello World Model Hypothesis

Yifei Yuan, Anders Søgaard

TL;DR

The paper investigates whether large language models induce world models by training seven architectures on Othello move sequences and evaluating 1-hop and 2-hop move generation. It combines cross-model representation alignment and latent move projection to test whether models learn shared, spatially structured board representations, finding up to 99% unsupervised grounding accuracy and high cross-model alignment (e.g., 93.1% in at least one synthetic pairing). The authors show that models converge on similar board-state representations and capture spatial relationships, reinforcing the Othello World Model Hypothesis beyond prior probing studies. These results have implications for understanding symbol grounding and structured reasoning in LLMs, suggesting that they can internalize dynamic environments from sequences of abstract actions.

Abstract

Li et al. (2023) used the Othello board game as a test case for the ability of GPT-2 to induce world models, and were followed up by Nanda et al. (2023b). We briefly discuss the original experiments, expanding them to include more language models with more comprehensive probing. Specifically, we analyze sequences of Othello board states and train the model to predict the next move based on previous moves. We evaluate seven language models (GPT-2, T5, Bart, Flan-T5, Mistral, LLaMA-2, and Qwen2.5) on the Othello task and conclude that these models not only learn to play Othello, but also induce the Othello board layout. We find that all models achieve up to 99% accuracy in unsupervised grounding and exhibit high similarity in the board features they learned. This provides considerably stronger evidence for the Othello World Model Hypothesis than previous works.

Revisiting the Othello World Model Hypothesis

TL;DR

The paper investigates whether large language models induce world models by training seven architectures on Othello move sequences and evaluating 1-hop and 2-hop move generation. It combines cross-model representation alignment and latent move projection to test whether models learn shared, spatially structured board representations, finding up to 99% unsupervised grounding accuracy and high cross-model alignment (e.g., 93.1% in at least one synthetic pairing). The authors show that models converge on similar board-state representations and capture spatial relationships, reinforcing the Othello World Model Hypothesis beyond prior probing studies. These results have implications for understanding symbol grounding and structured reasoning in LLMs, suggesting that they can internalize dynamic environments from sequences of abstract actions.

Abstract

Li et al. (2023) used the Othello board game as a test case for the ability of GPT-2 to induce world models, and were followed up by Nanda et al. (2023b). We briefly discuss the original experiments, expanding them to include more language models with more comprehensive probing. Specifically, we analyze sequences of Othello board states and train the model to predict the next move based on previous moves. We evaluate seven language models (GPT-2, T5, Bart, Flan-T5, Mistral, LLaMA-2, and Qwen2.5) on the Othello task and conclude that these models not only learn to play Othello, but also induce the Othello board layout. We find that all models achieve up to 99% accuracy in unsupervised grounding and exhibit high similarity in the board features they learned. This provides considerably stronger evidence for the Othello World Model Hypothesis than previous works.

Paper Structure

This paper contains 27 sections, 2 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: Experimental protocol. We re-train the Transformer-based models to predict the next move in Othello and see whether the board game layout is induced (up to isomorphism).
  • Figure 2: Othello 1-hop generation error rate under different model sizes. All models are non-pretrained versions fine-tuned with 20k game sequences.
  • Figure 3: Analysis of 1-hop error rates on the SYNTHETIC dataset with varying data scales.
  • Figure 4: PCA visualization of the 60 steps from various models within one game.
  • Figure 5: Decoder feature similarity heatmap across different layers.
  • ...and 5 more figures