Table of Contents
Fetching ...

A Primer on the Inner Workings of Transformer-based Language Models

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà

TL;DR

This primer surveys interpretability methods for decoder-only Transformer language models, clarifying two primary axes: localization (attributing predictions to inputs or components) and information decoding (understanding what representations encode). It unifies notation and presents decomposition-based views of forward passes, causal interventions, and circuit discovery to reveal how attention heads, FFNs, and the residual stream contribute to predictions. It catalogs a wide spectrum of techniques—from gradient and perturbation attribution to subspace patching, SAEs, probing, and logit-space analyses—and synthesizes numerous discovered inner behaviors, such as induction, copying, attention sinks, and multi-component Grokking circuits. The work also highlights practical interpretability tools and cautions about faithfulness and generalization limits, offering directions for more actionable, safe, and human-centric explanations in real-world settings.

Abstract

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

A Primer on the Inner Workings of Transformer-based Language Models

TL;DR

This primer surveys interpretability methods for decoder-only Transformer language models, clarifying two primary axes: localization (attributing predictions to inputs or components) and information decoding (understanding what representations encode). It unifies notation and presents decomposition-based views of forward passes, causal interventions, and circuit discovery to reveal how attention heads, FFNs, and the residual stream contribute to predictions. It catalogs a wide spectrum of techniques—from gradient and perturbation attribution to subspace patching, SAEs, probing, and logit-space analyses—and synthesizes numerous discovered inner behaviors, such as induction, copying, attention sinks, and multi-component Grokking circuits. The work also highlights practical interpretability tools and cautions about faithfulness and generalization limits, offering directions for more actionable, safe, and human-centric explanations in real-world settings.

Abstract

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.
Paper Structure (87 sections, 28 equations, 16 figures, 1 table)

This paper contains 87 sections, 28 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Survey overview. \ref{['sec:components_transformer_lm']} introduces the Transformer language model and its components. \ref{['sec:behavior_localization']} and \ref{['sec:information_decoding']} present interpretability techniques used to analyze models' inner workings. Finally, \ref{['sec:what_we_know_transformer']} presents known inner workings of Transformer language models.
  • Figure 2: Unrolled Transformer LM with expanded views of the Attention and Feedforward network blocks, including model weights (gray) and residual stream states (green). Based on figures from ferrando2024informationvoita2023neurons.
  • Figure 3: Forward pass decomposition in a simplified Transformer LM. The direct path (red), full OV circuits (yellow) and virtual attention heads (grey) expressed in \ref{['eq:transformer_paths']} are highlighted.
  • Figure 4: Three approaches to compute inter-token contributions ($c_{i,j}$) towards context mixing in attention heads. Relying only on attention weights overlooks the magnitude of the vectors they operate on. This limitation can be addressed by accounting for the norm of the value-weighted or output-value-weighted vectors (${\bm{x}}_j'$). Finally, distance-based analysis estimates the contribution of weighted vectors from their proximity to the attention output.
  • Figure 5: Direct Logit Attributions (DLA) on output token $w$. (a) DLA of an attention head $\text{Attn}^{l,h}$, (b) DLA of an intermediate representation ${\bm{x}}_{1}^{l-1}$ via an attention head, (c) DLA of an FFN block, and (d) DLA of a single neuron.
  • ...and 11 more figures