Table of Contents
Fetching ...

Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT

Dean S. Hazineh, Zechen Zhang, Jeffery Chiu

TL;DR

This work investigates whether a simple transformer trained to play Othello develops an interpretable, linear world model. It shows that the board-state can be linearly decoded from activations, and that this representation causally influences next-move decisions in mid-layer depths, with depth and model complexity shaping the strength of the effect. The authors introduce a causal-intervention method that perturbs layer-wise linear representations to test their role in prediction, revealing that the world model is largely formed in middle layers and used to guide decisions, while late layers rely more on surface statistics. These findings advance mechanistic interpretability by demonstrating that even small transformers can harbor linear, causally relevant world representations, and they provide a framework for probing when, where, and how such representations contribute to decision-making.

Abstract

Foundation models exhibit significant capabilities in decision-making and logical deductions. Nonetheless, a continuing discourse persists regarding their genuine understanding of the world as opposed to mere stochastic mimicry. This paper meticulously examines a simple transformer trained for Othello, extending prior research to enhance comprehension of the emergent world model of Othello-GPT. The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity. We have made the code public.

Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT

TL;DR

This work investigates whether a simple transformer trained to play Othello develops an interpretable, linear world model. It shows that the board-state can be linearly decoded from activations, and that this representation causally influences next-move decisions in mid-layer depths, with depth and model complexity shaping the strength of the effect. The authors introduce a causal-intervention method that perturbs layer-wise linear representations to test their role in prediction, revealing that the world model is largely formed in middle layers and used to guide decisions, while late layers rely more on surface statistics. These findings advance mechanistic interpretability by demonstrating that even small transformers can harbor linear, causally relevant world representations, and they provide a framework for probing when, where, and how such representations contribute to decision-making.

Abstract

Foundation models exhibit significant capabilities in decision-making and logical deductions. Nonetheless, a continuing discourse persists regarding their genuine understanding of the world as opposed to mere stochastic mimicry. This paper meticulously examines a simple transformer trained for Othello, extending prior research to enhance comprehension of the emergent world model of Othello-GPT. The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity. We have made the code public.
Paper Structure (17 sections, 1 equation, 11 figures, 2 tables)

This paper contains 17 sections, 1 equation, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Overview of the principles in section \ref{['sec:world_representation']}. (a) The neural architecture utilized in this paper, where the number of layers refers to the number of transformer blocks. (b) The original intervention scheme of li2023emergent is replaced by an alternate version shown in (c), whereby intervention is applied to a single layer--see text for details. (Bottom Panel) Example of an extracted world representation where the game board state is obtained from the activation vectors.
  • Figure 2: Example attention heads of the 8L8H model. Note that both heads show alternating patterns, which we interpret as processing information from pieces first placed by the same player. For example, Head 3 is keeping track of all my historical moves and Head 4 is keeping track of all the opponent's historical moves.
  • Figure 3: Latent saliency map for a particular game, interventions at layer 6. $m^*$ is highlighted with a red box which indicates the move that we calculate logits for and the numbers indicate the logit changes given interventions at that tile. If a possible next move is currently illegal, we observe that interventions that would make the move legal produce a positive change (bright yellow); alternatively, interventions that make a currently legal move now illegal have a negative change (dark blue).
  • Figure 4: Comparison of logit distributions with causal interventions. (a) Logit distribution for move option 1 at a particular game length (unplayed board). (b) Logit distribution of move option 2 pre- and post-interventions. The logit distribution post intervention resembles that of move option 1. (c) Cosine similarities between the two options pre and post interventions, averaged over 50 sample games. The difference between the the two cosine similarities shows when and where intervention is most influential.
  • Figure 5: (First Row) Losses for various sequential models trained to play legal moves in Othello. The average percent of legal next-moves played on a validation set are [94.9$\%$, 98.6$\%$, 99.7$\%$, 99.6$\%$, 99.9$\%$] for the models in the legend. (Bottom rows) Linear probe accuracy for different models at each layer.
  • ...and 6 more figures