GPT-2 Through the Lens of Vector Symbolic Architectures
Johannes Knittel, Tushaar Gangavarapu, Hendrik Strobelt, Hanspeter Pfister
TL;DR
The paper investigates whether GPT-2 operates through vector symbolic architectures by interpreting the residual stream as a repository of nearly orthogonal concept vectors that are bundled and bound across layers. It combines theoretical framing with targeted analyses on GPT-2 small, showing that word embeddings, attention, and MLP projections align with VSAs-like operations, and that many neurons can be explained by simple bundles of token vectors. The findings suggest that a substantial portion of learned weights correspond to bundling/binding circuits and that boolean-function-like processing can occur in MLPs on these concept vectors. This work advances interpretability by linking transformer mechanics to VSAs, with implications for sparse autoencoder disentanglement and architectural design, while noting the need for broader replication across larger models. Overall, the study provides a concrete mechanism to understand how high-dimensional, nearly orthogonal concept vectors can be composed and manipulated to generate next-token predictions.
Abstract
Understanding the general priniciples behind transformer models remains a complex endeavor. Experiments with probing and disentangling features using sparse autoencoders (SAE) suggest that these models might manage linear features embedded as directions in the residual stream. This paper explores the resemblance between decoder-only transformer architecture and vector symbolic architectures (VSA) and presents experiments indicating that GPT-2 uses mechanisms involving nearly orthogonal vector bundling and binding operations similar to VSA for computation and communication between layers. It further shows that these principles help explain a significant portion of the actual neural weights.
