Table of Contents
Fetching ...

GPT-2 Through the Lens of Vector Symbolic Architectures

Johannes Knittel, Tushaar Gangavarapu, Hendrik Strobelt, Hanspeter Pfister

TL;DR

The paper investigates whether GPT-2 operates through vector symbolic architectures by interpreting the residual stream as a repository of nearly orthogonal concept vectors that are bundled and bound across layers. It combines theoretical framing with targeted analyses on GPT-2 small, showing that word embeddings, attention, and MLP projections align with VSAs-like operations, and that many neurons can be explained by simple bundles of token vectors. The findings suggest that a substantial portion of learned weights correspond to bundling/binding circuits and that boolean-function-like processing can occur in MLPs on these concept vectors. This work advances interpretability by linking transformer mechanics to VSAs, with implications for sparse autoencoder disentanglement and architectural design, while noting the need for broader replication across larger models. Overall, the study provides a concrete mechanism to understand how high-dimensional, nearly orthogonal concept vectors can be composed and manipulated to generate next-token predictions.

Abstract

Understanding the general priniciples behind transformer models remains a complex endeavor. Experiments with probing and disentangling features using sparse autoencoders (SAE) suggest that these models might manage linear features embedded as directions in the residual stream. This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA) and presents experiments indicating that GPT-2 uses mechanisms involving nearly orthogonal vector bundling and binding operations similar to VSA for computation and communication between layers. It further shows that these principles help explain a significant portion of the actual neural weights.

GPT-2 Through the Lens of Vector Symbolic Architectures

TL;DR

The paper investigates whether GPT-2 operates through vector symbolic architectures by interpreting the residual stream as a repository of nearly orthogonal concept vectors that are bundled and bound across layers. It combines theoretical framing with targeted analyses on GPT-2 small, showing that word embeddings, attention, and MLP projections align with VSAs-like operations, and that many neurons can be explained by simple bundles of token vectors. The findings suggest that a substantial portion of learned weights correspond to bundling/binding circuits and that boolean-function-like processing can occur in MLPs on these concept vectors. This work advances interpretability by linking transformer mechanics to VSAs, with implications for sparse autoencoder disentanglement and architectural design, while noting the need for broader replication across larger models. Overall, the study provides a concrete mechanism to understand how high-dimensional, nearly orthogonal concept vectors can be composed and manipulated to generate next-token predictions.

Abstract

Understanding the general priniciples behind transformer models remains a complex endeavor. Experiments with probing and disentangling features using sparse autoencoders (SAE) suggest that these models might manage linear features embedded as directions in the residual stream. This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA) and presents experiments indicating that GPT-2 uses mechanisms involving nearly orthogonal vector bundling and binding operations similar to VSA for computation and communication between layers. It further shows that these principles help explain a significant portion of the actual neural weights.

Paper Structure

This paper contains 14 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: For each layer, the output of the attention and the feedforward network (MLP) are added to the residual stream of GPT-2, which could be interpreted as a combination of (un)binding and bundling operations using nearly orthogonal vectors.
  • Figure 2: Circuit discovery using similarities between $W_{out}$ and $W_{in}$ vectors based on greedily obtained concept vectors. The two neurons in the first layer each focus on a set of tokens. The neuron 7-1321 seems to model a union of those sets, whose output is similar to the input of the depicted neuron in the last layer and the to unembedding. The last one has many inbound connections to previous outputs and seems to boost the concept of the to vector.
  • Figure 3: The attention matrices are composed of nearly orthogonal vectors. This is an exemplary selection of attention heads and their corresponding matrix multiplication $M M^\intercal$.
  • Figure 4: The output projection matrices of the MLP blocks are composed of nearly orthogonal vectors. This is an exemplary selection of output projection matrices for specific layers and the corresponding matrix multiplication $M M^\intercal$. We only show the top left cutout of the visualization as each layer contains 3072 neurons.