A Primer on the Inner Workings of Transformer-based Language Models
Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà
TL;DR
This primer surveys interpretability methods for decoder-only Transformer language models, clarifying two primary axes: localization (attributing predictions to inputs or components) and information decoding (understanding what representations encode). It unifies notation and presents decomposition-based views of forward passes, causal interventions, and circuit discovery to reveal how attention heads, FFNs, and the residual stream contribute to predictions. It catalogs a wide spectrum of techniques—from gradient and perturbation attribution to subspace patching, SAEs, probing, and logit-space analyses—and synthesizes numerous discovered inner behaviors, such as induction, copying, attention sinks, and multi-component Grokking circuits. The work also highlights practical interpretability tools and cautions about faithfulness and generalization limits, offering directions for more actionable, safe, and human-centric explanations in real-world settings.
Abstract
The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.
