Table of Contents
Fetching ...

Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models

Adam Karvonen

TL;DR

<3-5 sentence high-level summary> This work extends the investigation of emergent world models in language models from synthetic Othello to real chess by training GPT-style models on million-scale chess transcripts. It shows that linear probes can recover internal board-state representations and that latent variables such as player skill (Elo) are encoded and usable to improve predictive performance. Through causal interventions on model activations, the authors demonstrate that editing the internal board state and manipulating skill can meaningfully alter playing behavior, including substantial win-rate gains on challenging setups. The results offer a concrete, interpretable view of how world models and latent concepts emerge in constrained domains and point to practical intervention techniques for steering LLMs in structured tasks.

Abstract

Language models have shown unprecedented capabilities, sparking debate over the source of their performance. Is it merely the outcome of learning syntactic patterns and surface level statistics, or do they extract semantics and a world model from the text? Prior work by Li et al. investigated this by training a GPT model on synthetic, randomly generated Othello games and found that the model learned an internal representation of the board state. We extend this work into the more complex domain of chess, training on real games and investigating our model's internal representations using linear probes and contrastive activations. The model is given no a priori knowledge of the game and is solely trained on next character prediction, yet we find evidence of internal representations of board state. We validate these internal representations by using them to make interventions on the model's activations and edit its internal board state. Unlike Li et al's prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character. We derive a player skill vector and add it to the model, improving the model's win rate by up to 2.6 times.

Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models

TL;DR

<3-5 sentence high-level summary> This work extends the investigation of emergent world models in language models from synthetic Othello to real chess by training GPT-style models on million-scale chess transcripts. It shows that linear probes can recover internal board-state representations and that latent variables such as player skill (Elo) are encoded and usable to improve predictive performance. Through causal interventions on model activations, the authors demonstrate that editing the internal board state and manipulating skill can meaningfully alter playing behavior, including substantial win-rate gains on challenging setups. The results offer a concrete, interpretable view of how world models and latent concepts emerge in constrained domains and point to practical intervention techniques for steering LLMs in structured tasks.

Abstract

Language models have shown unprecedented capabilities, sparking debate over the source of their performance. Is it merely the outcome of learning syntactic patterns and surface level statistics, or do they extract semantics and a world model from the text? Prior work by Li et al. investigated this by training a GPT model on synthetic, randomly generated Othello games and found that the model learned an internal representation of the board state. We extend this work into the more complex domain of chess, training on real games and investigating our model's internal representations using linear probes and contrastive activations. The model is given no a priori knowledge of the game and is solely trained on next character prediction, yet we find evidence of internal representations of board state. We validate these internal representations by using them to make interventions on the model's activations and edit its internal board state. Unlike Li et al's prior synthetic dataset approach, our analysis finds that the model also learns to estimate latent variables like player skill to better predict the next character. We derive a player skill vector and add it to the model, improving the model's win rate by up to 2.6 times.
Paper Structure (19 sections, 4 equations, 5 figures, 9 tables)

This paper contains 19 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Heat maps of the model's internal board state derived from the probe outputs, which have been trained on a one-hot classification objective. The probes output log probabilities for the 13 different piece types at every square, which we can use to construct a heat map for any piece type. The left heat maps display ground truth piece locations. The right heat maps display a gradient of model confidence on piece locations. To view a more binary heat map, we can clip these values to be between -2 and 0, which can be seen in the center heat map. The model has reasonable representations. It is very confident that the black king is not on the white side of the board.
  • Figure 2: Model win rate versus a range of Stockfish 16 levels. The 16 layer and 8 layer models are trained on identical datasets for an identical number of epochs. The 16 layer model consistently has a higher win rate than the 8 layer model. While precise Elo measurements for Stockfish are complex, a reasonable approximation for level 0 is around 1300 Elo. Details in Appendix \ref{['appendix:stockfish_evals']}
  • Figure 3: We test linear probes for player Elo classification and board square state classification on every layer of each model. The 8 layer model computes an accurate board state by layer 6, yet the 16 layer model doesn't obtain similar accuracy until layer 12. Oddly, the skill probes trained on randomized models become more accurate on deeper layers.
  • Figure 4: Board State Intervention Process. We first sample the model's next move prediction, identifying a strategically relevant piece the model intends to move (e.g., white pawn from C2 to C3). We then delete this piece from both the original board and the model's internal representation by subtracting the corresponding vector (the 512-dimensional "my pawn" vector from the C2 square linear probe) from the model's residual stream. Using the unmodified PGN string input, we generate 5 new moves from the modified model. A successful intervention results in all moves being legal under the hypothetical modified board state, despite the model receiving no explicit information about the piece removal.
  • Figure 5: In this intervention, we delete the C2 pawn described in Figure \ref{['figure:board_intervention']}. After the intervention, the C2 pawn has been erased from the model's internal representation. However, the other pawns are less distinct, indicating that the intervention has had unintended side effects.