Table of Contents
Fetching ...

Singular Vectors of Attention Heads Align with Features

Gabriel Franco, Carson Loughridge, Mark Crovella

TL;DR

The paper investigates why singular vectors of attention heads align with features in language models and when this alignment occurs. Using a tractable toy autoencoder-plus-attention head, it proves exact or approximate SVF alignment under isotropic or near-isotropic feature distributions and shows orthogonalization of non-target features to minimize interference. It proposes sparse attention decomposition (SAD) as a testable prediction and demonstrates SAD emerging in both toy models and real models (GPT-2 and Pythia), with logits decomposing sparsely in the SVD basis when features are present. These results justify SVF alignment as a sound, scalable approach for identifying feature representations in transformers, with practical implications for mechanistic interpretability and causal analysis.

Abstract

Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in a manner consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.

Singular Vectors of Attention Heads Align with Features

TL;DR

The paper investigates why singular vectors of attention heads align with features in language models and when this alignment occurs. Using a tractable toy autoencoder-plus-attention head, it proves exact or approximate SVF alignment under isotropic or near-isotropic feature distributions and shows orthogonalization of non-target features to minimize interference. It proposes sparse attention decomposition (SAD) as a testable prediction and demonstrates SAD emerging in both toy models and real models (GPT-2 and Pythia), with logits decomposing sparsely in the SVD basis when features are present. These results justify SVF alignment as a sound, scalable approach for identifying feature representations in transformers, with practical implications for mechanistic interpretability and causal analysis.

Abstract

Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in a manner consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.
Paper Structure (44 sections, 7 theorems, 64 equations, 20 figures, 2 tables)

This paper contains 44 sections, 7 theorems, 64 equations, 20 figures, 2 tables.

Key Result

Lemma 1

Let $a,b\in\mathbb{R}^m$. Then where $\mathbf{1}\in\mathbb{R}^m$ is the all-ones vector.

Figures (20)

  • Figure 1: The geometry of features as illustrated via cosine similarities. (a) Without the attention head, features arrange isotropically. (b) With 20 features of which $w_0, w_1$ are of interest, features of interest orthogonalize against the others. (c) With 100 features in dimension 50, and 40 of those features are of interest (20 pairs), features of interest also orthogonalize against the others.
  • Figure 2: Singular vectors align with features (shown by green boxes). Cosine similarities of singular vectors and features, and magnitudes of singular values. (a) 20 Features of which $w_0, w_1$ are of interest; $w_0$ aligns only with $u_0$ and $w_1$ aligns only with $v_0$. (b) 100 features of which $w_0 \dots w_{39}$ are of interest. For clarity, only an initial subset of features is shown; full figures are in Appendix \ref{['app:model']}.
  • Figure 3: Both singular vectors and features evolve during training, and alignment occurs for highest-logit features first. Above: Cosine similarities showing evolution of singular vectors (top) and features (bottom). Below: Cosine similarities showing evolution of alignment of singular vectors with features.
  • Figure 4: Relative attention decomposition is sparse when a single feature pair is present. Top: Early in training; Bottom: Late in training.
  • Figure 5: Sparse attention decomposition identifies feature presence. Top: Decomposition of relative attention across all 10 singular vectors for five token pairs $(r, s)$. Bottom: Feature strength ($f^{(r)}_i f^{(s)}_{i+4}$) for the four corresponding feature pairs $w_i, w_{i+4}$.
  • ...and 15 more figures

Theorems & Definitions (14)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • ...and 4 more