Decomposing Query-Key Feature Interactions Using Contrastive Covariances
Andrew Lee, Yonatan Belinkov, Fernanda Viégas, Martin Wattenberg
TL;DR
The paper tackles the challenge of interpreting Transformer attention by examining the query-key (QK) space as a bilinear interaction and introducing a contrastive covariance framework to extract interpretable, low-rank feature subspaces. By constructing positive and negative covariances that isolate a given latent feature, the method recovers the ranks and subspaces in both the query and key spaces via SVD, enabling causal interventions and logit-level attribution. The approach is validated analytically on a toy payload-retrieval model and empirically on large language models (e.g., Llama 3.1-8B Instruct and Qwen 3-4B Instruct), where it reveals categorical semantic subspaces in Filter Heads and binding features (order-ID and lexical) with tangible visualizations and logit decompositions. This yields interpretable, testable insights into how attention arises from specific QK features, offering a path toward more transparent and controllable transformer behavior, including feature-level logit attributions and potential safety benefits. $QK$-space decomposition thus provides a practical, causal, and interpretable lens on attention mechanisms, with clear implications for model diagnostics and design.
Abstract
Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.
