Table of Contents
Fetching ...

MetaOthello: A Controlled Study of Multiple World Models in Transformers

Aviral Chawla, Galen Hall, Juniper Lovato

TL;DR

MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, is introduced and it is found that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants.

Abstract

Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.

MetaOthello: A Controlled Study of Multiple World Models in Transformers

TL;DR

MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, is introduced and it is found that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants.

Abstract

Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.
Paper Structure (42 sections, 8 equations, 11 figures, 1 table)

This paper contains 42 sections, 8 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: The MetaOthello Framework. (Left) We define a universe of games sharing a board size and vocabulary but differing in dynamics. (Middle) We sample sequences from these games. Early in the game, sequences are often valid under both rule sets (the "Ambiguous" branch), creating an informational conflict for the model. (Right) We train a small GPT model on these sequences and use Linear Probes on the residual stream to reconstruct the internal board representation.
  • Figure 2: Cosine similarity between board state probe weights in mixed models, with random baseline controls. Blue bars show raw cosine similarity; orange bars show similarity after per-layer Procrustes alignment. Black circles and squares indicate expected similarity for random probes (shuffled to preserve distribution) before and after alignment, respectively. For Classic vs. Iago (center), raw similarity matches the random raw baseline (0.03). However, after alignment, similarity reaches 0.98—substantially exceeding the random Procrustes baseline of 0.68. Error bars denote 95% CIs across 192 probe dimensions.
  • Figure 3: Global intervention error across conditions. Gray bars show null baseline (no intervention); blue bars show correct-probe intervention; orange bars show cross-probe intervention. Error bars denote 95% CI. We see that cross-probe intervention is nearly as effective as the correct probe in steering board states.
  • Figure 4: Classic-to-Iago activation alignment via orthogonal Procrustes. We feed Classic sequences to the mixed Classic-Iago model, apply a learned orthogonal rotation $\Omega$ to the residual stream at layer $l'$, and measure how well the model predicts corresponding Iago moves ($\alpha$ score).
  • Figure 5: Differentiation dynamics in the Classic--NoMidFlip mixed model. (a) performance of NoMidFlip probes on tiles that differ between NoMidFlip and Classic after an ambiguous sequence $s^*$. (b) Probe Fidelity: Probes trained to estimate the probability of the game context ($P(\text{Classic})$) where fidelity is ($1 - |P_{\text{probe}} - P_{\text{GT}}|$). We also train a "Baseline" probe on one-hot encoded $60 \times 60$ (move $\times$ move number) inputs. Inset: Average entropy of the ground truth game distribution $P(g|s_{<t})$ over time. (c) Causal Steering: Injecting a game-steering vector ($\lambda[\mu_{\text{NoMid}} - \mu_{\text{Classic}}]$) to measure the change in adherence to NoMidFlip rules, measured by the normalized increase in $\alpha$-score relative to NoMidFlip-valid moves.
  • ...and 6 more figures