Table of Contents
Fetching ...

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

Tyler A. Chang, Benjamin K. Bergen

TL;DR

The paper investigates whether a minimal, interpretable circuit underpins next-token predictions in Transformer language models by extracting bigram subnetworks that reproduce P(w_i | w_{i-1}) using a small subset of parameters. Using continuous sparsification, the authors identify subnetworks of roughly 10M non-embedding parameters (about 0.1–0.2% of the model) that achieve $r>0.95$ correlation with bigram behavior across models up to 1B parameters, with the bulk of activity in the first MLP layer. Detailed analyses show these subnetworks recreate key residual-stream dynamics, including the initial transformation from current to next-token space, and they align closely with optimally pruned subnetworks, with ablations causing large drops in performance. This work provides a principled, sparse building block for mechanistic interpretability and suggests a pathway to studying more complex circuits by assembling beyond the minimal bigram subnetwork.

Abstract

In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.

Bigram Subnetworks: Mapping to Next Tokens in Transformer Language Models

TL;DR

The paper investigates whether a minimal, interpretable circuit underpins next-token predictions in Transformer language models by extracting bigram subnetworks that reproduce P(w_i | w_{i-1}) using a small subset of parameters. Using continuous sparsification, the authors identify subnetworks of roughly 10M non-embedding parameters (about 0.1–0.2% of the model) that achieve correlation with bigram behavior across models up to 1B parameters, with the bulk of activity in the first MLP layer. Detailed analyses show these subnetworks recreate key residual-stream dynamics, including the initial transformation from current to next-token space, and they align closely with optimally pruned subnetworks, with ablations causing large drops in performance. This work provides a principled, sparse building block for mechanistic interpretability and suggests a pathway to studying more complex circuits by assembling beyond the minimal bigram subnetwork.

Abstract

In Transformer language models, activation vectors transform from current token embeddings to next token predictions as they pass through the model. To isolate a minimal form of this transformation, we identify language model subnetworks that make bigram predictions, naive next token predictions based only on the current token. We find that bigram subnetworks can be found in fully trained language models up to 1B parameters, and these subnetworks are critical for model performance even when they consist of less than 0.2% of model parameters. Bigram subnetworks are concentrated in the first Transformer MLP layer, and they overlap significantly with subnetworks trained to optimally prune a given model. Mechanistically, the bigram subnetworks often recreate a pattern from the full models where the first layer induces a sharp change that aligns activations with next token predictions rather than current token representations. Our results demonstrate that bigram subnetworks comprise a minimal subset of parameters that are both necessary and sufficient for basic next token predictions in language models, and they help drive the transformation from current to next token activations in the residual stream. These subnetworks can lay a foundation for studying more complex language model circuits by building up from a minimal circuit.

Paper Structure

This paper contains 28 sections, 1 equation, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Left, center: bigram surprisal correlations for subnetworks with different numbers of active parameters (excluding embedding parameters), for different models. Bigram correlations are scaled to the highest bigram correlation for any subnetwork trained for that model. Correlations plateau at roughly 10M active parameters regardless of model size. Right: bigram surprisal correlation $r$ for the highest-correlation subnetwork vs. the full model for each model. GPT-2 small* indicates the GPT-2 small replication from chang-etal-2024-characterizing.
  • Figure 2: Left: estimated bigram surprisal correlation for bigram subnetworks with different numbers of active parameters (excluding embedding parameters) for Pythia 1B at different checkpoints (§\ref{['sec:persistence']}). Center, right: proportions of parameters in the Pythia 1B bigram subnetwork that are in each MLP and attention layer throughout pretraining (§\ref{['sec:structure']}). Note that the color bar scale is 10$\times$ larger for MLP proportions, as a far greater proportion of bigram subnetwork parameters are in the MLP layers.
  • Figure 3: Median rotation to input (current token) activations and to output (next token) activations at each layer in GPT-2 large and Pythia 160M, for the full model, the bigram subnetwork, and a random subnetwork with the same size and structure as the bigram subnetwork (§\ref{['sec:rotations']}). In full models and their bigram subnetworks, the first layer induces a notable rotation towards next token representations.
  • Figure 4: Cross-layer covariance similarities for Pythia 1B in the full model, its bigram subnetwork, and a random subnetwork with the same size and structure as the bigram subnetwork (§\ref{['sec:covariances']}). The bigram subnetwork recreates many of the patterns from the full model, despite consisting of only 0.17% of non-embedding parameters.
  • Figure 5: Left: surprisal correlations between optimal subnetworks and the original model, and between optimal subnetworks and bigram predictions, for different numbers of active parameters in Pythia 1B (§\ref{['sec:training-optimal']}). Right: language modeling evaluation loss when ablating a random subnetwork with the same size and structure as the bigram subnetwork, the bigram subnetwork itself, or an optimal subnetwork of similar size to the bigram subnetwork (§\ref{['sec:ablations']}).
  • ...and 4 more figures