Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning
Junxuan Wang, Xuyang Ge, Wentao Shu, Zhengfu He, Xipeng Qiu
TL;DR
The paper shows that transformer attention outputs occupy a low-dimensional subspace, with an effective dimensionality around $0.60$ of the full hidden space, contrasting with MLP and residual streams near full rank. This low-rank structure is largely shaped by the output projection $W^O$ and is linked to the dead-feature problem in sparse dictionary learning. It then introduces Active Subspace Initialization (ASI), a subspace-aware way to initialize sparse autoencoders (and related sparse replacements), which substantially reduces dead features (from about $87\%$ to under $1\%$ in large SAEs) and improves reconstruction, while generalizing to models like Lorsa. The work provides both a mechanistic view of attention geometry and a practical tool for scaling sparse representations in large language models, enabling more efficient and robust interpretability and deployment of sparse dictionary methods.
Abstract
Transformer architectures, and their attention mechanisms in particular, form the foundation of modern large language models. While transformer models are widely believed to operate in high-dimensional hidden spaces, we show that attention outputs are in fact confined to a surprisingly low-dimensional subspace, with an effective dimensionality of only about $60\%$ of the full space. In contrast, MLP outputs and residual streams remain much closer to full-rank, exhibiting effective ranks around $90\%$. This striking dimensional discrepancy is consistently observed across diverse model families and datasets, and is strongly shaped by the attention output projection matrix. Critically, we find this low-rank structure as a key factor of the prevalent dead feature problem in sparse dictionary learning, where it creates a mismatch between randomly initialized features and the intrinsic geometry of the activation space. Building on this insight, we propose a subspace-constrained training method for sparse autoencoders (SAEs), initializing feature directions into the active subspace of activations. Our approach reduces dead features from 87\% to below 1\% in Attention Output SAEs with 1M features, and can further extend to other sparse dictionary learning methods. Our findings provide both new insights into the geometry of attention and practical tools for improving sparse dictionary learning in large language models.
