Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT
Dean S. Hazineh, Zechen Zhang, Jeffery Chiu
TL;DR
This work investigates whether a simple transformer trained to play Othello develops an interpretable, linear world model. It shows that the board-state can be linearly decoded from activations, and that this representation causally influences next-move decisions in mid-layer depths, with depth and model complexity shaping the strength of the effect. The authors introduce a causal-intervention method that perturbs layer-wise linear representations to test their role in prediction, revealing that the world model is largely formed in middle layers and used to guide decisions, while late layers rely more on surface statistics. These findings advance mechanistic interpretability by demonstrating that even small transformers can harbor linear, causally relevant world representations, and they provide a framework for probing when, where, and how such representations contribute to decision-making.
Abstract
Foundation models exhibit significant capabilities in decision-making and logical deductions. Nonetheless, a continuing discourse persists regarding their genuine understanding of the world as opposed to mere stochastic mimicry. This paper meticulously examines a simple transformer trained for Othello, extending prior research to enhance comprehension of the emergent world model of Othello-GPT. The investigation reveals that Othello-GPT encapsulates a linear representation of opposing pieces, a factor that causally steers its decision-making process. This paper further elucidates the interplay between the linear world representation and causal decision-making, and their dependence on layer depth and model complexity. We have made the code public.
