Table of Contents
Fetching ...

How GPT learns layer by layer

Jason Du, Kelly Hong, Alishba Imran, Erfan Jahanparast, Mehdi Khfifi, Kaichun Qiao

TL;DR

This paper investigates how GPT-style models learn internal world models by analyzing layer-by-layer representations in a small GPT variant trained on Othello. It compares Sparse Autoencoders (SAEs) and linear probes to decode board state, tile color, and tile stability, finding that SAEs uncover more disentangled, compositional features while linear probes capture increasingly predictive signals. The results reveal a hierarchical progression where early layers encode static board geometry and edges, while deeper layers reflect dynamic tile changes and stability, illustrating a mechanistic view of representation learning in transformers. The work provides a practical framework for interpreting internal representations in GPT-like models and offers publicly available code to extend the analysis to larger LLMs.

Abstract

Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE's and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT-JS/OthelloSAE.

How GPT learns layer by layer

TL;DR

This paper investigates how GPT-style models learn internal world models by analyzing layer-by-layer representations in a small GPT variant trained on Othello. It compares Sparse Autoencoders (SAEs) and linear probes to decode board state, tile color, and tile stability, finding that SAEs uncover more disentangled, compositional features while linear probes capture increasingly predictive signals. The results reveal a hierarchical progression where early layers encode static board geometry and edges, while deeper layers reflect dynamic tile changes and stability, illustrating a mechanistic view of representation learning in transformers. The work provides a practical framework for interpreting internal representations in GPT-like models and offers publicly available code to extend the analysis to larger LLMs.

Abstract

Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE's and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT-JS/OthelloSAE.
Paper Structure (21 sections, 2 equations, 6 figures, 7 tables)

This paper contains 21 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Our work is divided into three parts. The left side of the figure illustrates the architecture of OthelloGPT, designed to predict the next legal move in the game of Othello. The upper-right section shows how the Residual Stream from OthelloGPT is used as input to a SAE, enabling feature analysis through its sparse representations. The lower-right section presents a cosine similarity analysis between the parameters of individual neurons in the MLP layers of OthelloGPT and the linear probes we trained.
  • Figure 1: Tile Stability activation map (seed 2).
  • Figure 2: Linear probe accuracy for two seeds. The results demonstrate that linear probes effectively capture features that are good predictors of classification accuracy which increases over layers.
  • Figure 3: SAE Tile color activation maps. Showing frequency of tile color activations measured across 10 different seeds, as described in Section \ref{['subsec:tile_color_analysis']}.
  • Figure 4: Linear Probe tile color activation maps. Showing the tile color activations measured across layers, as described in Section \ref{['subsec:tile_color_analysis']}.
  • ...and 1 more figures