Table of Contents
Fetching ...

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

Andrew Lee, Yonatan Belinkov, Fernanda Viégas, Martin Wattenberg

TL;DR

The paper tackles the challenge of interpreting Transformer attention by examining the query-key (QK) space as a bilinear interaction and introducing a contrastive covariance framework to extract interpretable, low-rank feature subspaces. By constructing positive and negative covariances that isolate a given latent feature, the method recovers the ranks and subspaces in both the query and key spaces via SVD, enabling causal interventions and logit-level attribution. The approach is validated analytically on a toy payload-retrieval model and empirically on large language models (e.g., Llama 3.1-8B Instruct and Qwen 3-4B Instruct), where it reveals categorical semantic subspaces in Filter Heads and binding features (order-ID and lexical) with tangible visualizations and logit decompositions. This yields interpretable, testable insights into how attention arises from specific QK features, offering a path toward more transparent and controllable transformer behavior, including feature-level logit attributions and potential safety benefits. $QK$-space decomposition thus provides a practical, causal, and interpretable lens on attention mechanisms, with clear implications for model diagnostics and design.

Abstract

Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.

Decomposing Query-Key Feature Interactions Using Contrastive Covariances

TL;DR

The paper tackles the challenge of interpreting Transformer attention by examining the query-key (QK) space as a bilinear interaction and introducing a contrastive covariance framework to extract interpretable, low-rank feature subspaces. By constructing positive and negative covariances that isolate a given latent feature, the method recovers the ranks and subspaces in both the query and key spaces via SVD, enabling causal interventions and logit-level attribution. The approach is validated analytically on a toy payload-retrieval model and empirically on large language models (e.g., Llama 3.1-8B Instruct and Qwen 3-4B Instruct), where it reveals categorical semantic subspaces in Filter Heads and binding features (order-ID and lexical) with tangible visualizations and logit decompositions. This yields interpretable, testable insights into how attention arises from specific QK features, offering a path toward more transparent and controllable transformer behavior, including feature-level logit attributions and potential safety benefits. -space decomposition thus provides a practical, causal, and interpretable lens on attention mechanisms, with clear implications for model diagnostics and design.

Abstract

Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.
Paper Structure (26 sections, 26 equations, 20 figures, 1 table)

This paper contains 26 sections, 26 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Contrastive covariance method schema. We define positive and negative covariance terms between queries and keys, each capturing the presence (or absence) of a feature. The resulting contrastive covariance term isolates the feature in QK space.
  • Figure 2: Contrastive QK decomposition recovers the groundtruth rank of each latent variable, as long as there is no superposition (i.e., $r_1 + r_2 < d_\text{head}$). Each cell annotates the recovered ranks $r_1, r_2$, while the x and y-axes indicate the groundtruth ranks. The color of each cell indicates the difference between groundtruth and recovered ranks.
  • Figure 3: PCA of Latent Variable Subspace. We project key and query vectors onto the recovered subspaces of latent variable $\mathbf{z}_1$ (of rank $r_1 = 3$), then perform PCA, which recovers the 3D-cube structure of $\mathbf{z}_1$. Also note that keys and queries align onto the same clusters. See Figure \ref{['fig:toy_model_3d_pca_continuous']} for the continuous task variant, in which our method recovers the spherical structure of latent variable $\mathbf{s}_1$.
  • Figure 4: Causal Interventions on Latent Variable Subspaces. Intervening on the recovered subspaces for latent variables $\mathbf{z}_1$ and $\mathbf{z}_2$ shifts all the attention from the original token to the target token, while intervening on random subspaces of the same dimension (i.e., "Rand $r_1, r_2, r_1 + r_2$") has less of an effect.
  • Figure 5: Interactions between latent variables in QK space reveal feature splits and superposition. When the model has enough dimensions ($r_1 + r_2 \leq d_\text{head}$), the model further decomposes the latent variables into independent components (feature splits: strong diagonals in $\mathbf{G}$, as opposed to block diagonals). When there are not enough dimensions ($r_1 + r_2 > d_\text{head}$), we observe superposition, in which the model compresses both latent variables into fewer dimensions than available (off-diagonal interactions in $\mathbf{G}$).
  • ...and 15 more figures