Table of Contents
Fetching ...

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu

TL;DR

The paper shows that transformer attention outputs occupy a low-dimensional subspace, with an effective dimensionality around $0.60$ of the full hidden space, contrasting with MLP and residual streams near full rank. This low-rank structure is largely shaped by the output projection $W^O$ and is linked to the dead-feature problem in sparse dictionary learning. It then introduces Active Subspace Initialization (ASI), a subspace-aware way to initialize sparse autoencoders (and related sparse replacements), which substantially reduces dead features (from about $87\%$ to under $1\%$ in large SAEs) and improves reconstruction, while generalizing to models like Lorsa. The work provides both a mechanistic view of attention geometry and a practical tool for scaling sparse representations in large language models, enabling more efficient and robust interpretability and deployment of sparse dictionary methods.

Abstract

Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about $60\%$ of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around $90\%$. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87\% to below 1\% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning

TL;DR

The paper shows that transformer attention outputs occupy a low-dimensional subspace, with an effective dimensionality around of the full hidden space, contrasting with MLP and residual streams near full rank. This low-rank structure is largely shaped by the output projection and is linked to the dead-feature problem in sparse dictionary learning. It then introduces Active Subspace Initialization (ASI), a subspace-aware way to initialize sparse autoencoders (and related sparse replacements), which substantially reduces dead features (from about to under in large SAEs) and improves reconstruction, while generalizing to models like Lorsa. The work provides both a mechanistic view of attention geometry and a practical tool for scaling sparse representations in large language models, enabling more efficient and robust interpretability and deployment of sparse dictionary methods.

Abstract

Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around . This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87\% to below 1\% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.

Paper Structure

This paper contains 96 sections, 17 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: (left) Attention outputs exhibit pronounced low-rank structure compared to residual streams and MLP outputs, indicating that the attention layer writes into a subspace of the residual stream. (right) Low effective dimensionality of activations is a root cause of dead features in sparse dictionary learning methods. Setting feature directions in the active subspace mitigates this issue.
  • Figure 2: Across layers, model families and datasets, attention outputs exhibit dramatically lower effective rank than residual streams and MLP outputs, indicating that the attention layer writing into a low dimensional subspace of residual stream is a universal phenomenon. Details in Section \ref{['sec:activation_spectra:svd']}. (left) Evaluation of Llama-3.1-8B on SlimPajama cerebras2023slimpajama dataset. (mid) Middle-layer analysis across model families on SlimPajama dataset. (right) Middle-layer analysis of Llama-3.1-8B across datasets.
  • Figure 3: (a) The attention output is the most low-rank, as indicated by the sharpest decay in singular values. (b) Fraction of loss recovered using varying numbers of top singular components.
  • Figure 4: Decomposition of singular value spectra in attention output $O$. We analyze the contributions of the concatenated head outputs$Z$ and the projection matrix$W^O$ to the singular value of $O$ ($=$$Z$$W^O$). For each component, the red value is the product of the purple and blue values. The curve of $O$ closely follow that of $Z$ for the top components, whereas its downward trend at the tail is mainly due to $W^O$ contribution.
  • Figure 5: The number of dead features (left) and the effective rank (mid) of each activation in Llama-3.1-8B, shows a surprising consistency (right): activations with lower effective rank have more dead features, corresponding to all layers of attention output and last two layers of MLP output.
  • ...and 11 more figures

Theorems & Definitions (1)

  • Definition 4.1: Effective Rank, roy2007erank