Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik Jenner; Shreyas Kapur; Vasil Georgiev; Cameron Allen; Scott Emmons; Stuart Russell

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, Stuart Russell

TL;DR

The paper probes whether a chess-playing neural network learns look-ahead algorithms rather than relying solely on heuristics. By analyzing Leela Chess Zero's transformer-based policy network with activation patching, attention-head ablations, and a bilinear probing approach, it shows that representations of future moves—especially the 3rd move's target square—causally influence current decisions. A key finding is a 92% accurate bilinear probe that predicts the 3rd move two turns ahead, providing an existence proof of learned look-ahead in a real-world model. The work also identifies temporal information flow through specific attention heads and discusses limitations and potential generalizations to other domains.

Abstract

Do neural networks learn to implement algorithms such as look-ahead or search "in the wild"? Or do they rely purely on collections of simple heuristics? We present evidence of learned look-ahead in the policy network of Leela Chess Zero, the currently strongest neural chess engine. We find that Leela internally represents future optimal moves and that these representations are crucial for its final output in certain board states. Concretely, we exploit the fact that Leela is a transformer that treats every chessboard square like a token in language models, and give three lines of evidence (1) activations on certain squares of future moves are unusually important causally; (2) we find attention heads that move important information "forward and backward in time," e.g., from squares of future moves to squares of earlier ones; and (3) we train a simple probe that can predict the optimal move 2 turns ahead with 92% accuracy (in board states where Leela finds a single best line). These findings are an existence proof of learned look-ahead in neural networks and might be a step towards a better understanding of their capabilities.

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

TL;DR

Abstract

Paper Structure (38 sections, 13 figures)

This paper contains 38 sections, 13 figures.

Introduction
Experimental Setup
Leela Chess Zero
Puzzle dataset
Activation patching
Results
Activations on future move squares are unusually important
Attention heads move information forward and backward in time
L12H12 moves information "backward in time"
"Piece movement heads" help analyze consequences of future moves
Simple probes can predict future moves
Related Work
Chess-playing neural networks
Learned look-ahead and search
(Mechanistic) Interpretability
...and 23 more sections

Figures (13)

Figure 1: Activation patching lets us study where important information is stored in Leela. Here, we patch an activation in one particular square and layer from the forward pass on a "corrupted" board state (bottom) into the forward pass on a "clean" state (top). Each row in the network corresponds to one chessboard square, which Leela treats like a token in a language model. The intervention drastically affects Leela's output (right), telling us that the activation on the patched square stores information necessary for Leela's performance in this state. Only patching on specific squares has significant effects. See https://leela-interp.github.io/ for more (animated) examples.
Figure 2: Top row: An example of the puzzles we use. It is white's turn in the starting state, and the only winning action is to move the knight to g6. Black's only response is taking the knight with the pawn; then white checkmates by moving the rook to h4. We will see the colored squares again: the target square of the 1st move in this principal variation (green) and the target square of the 3rd move (blue). Below: Leela receives each state as a separate input and computes a policy in that state.
Figure 3: Results from activation patching in the residual stream. The top row shows results in a single example state at three select layers. Darker squares correspond to larger effects from intervening on that square. In the early layer, the effect is strongest when patching on the corrupted square h6, then in middle layers, the 3rd move target square h4 becomes important, and finally the 1st move target square g6 dominates in late layers. The line plot below shows mean effects over the entire dataset, demonstrating that this pattern holds beyond just this example. The "other squares" line is the maximum effect over all 61 other squares (where the maximum is taken per board state and then averaged). Error bars are two times the standard error of the mean.
Figure 4: Mean log odds reduction from activation patching attention head outputs one head a a time. The head that stands out the most is L12H12.
Figure 5: Zero-ablations in the attention pattern in L12H12. Green line: ablation of the attention entry with key on the 3rd and query on the 1st target square. Gray line: ablation of all 4095 other entries at once. The lines show the effect at a given percentile of puzzles sorted by effect size. Error bars are 95% CIs; see \ref{['sec:errors']}.
...and 8 more figures

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

TL;DR

Abstract

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Authors

TL;DR

Abstract

Table of Contents

Figures (13)