Spectral Filters, Dark Signals, and Attention Sinks
Nicola Cancedda
TL;DR
This work extends the logit lens with logit spectroscopy by defining spectral filters over embedding/unembedding singular vectors to probe dark subspaces in large language models. It shows that tail-spectrum signals govern attention sinking and that substantial portions of the spectrum can be suppressed without harming next-token prediction, provided sinking pathways are preserved. The authors introduce sink-preserving filters (Ω) and demonstrate robust generation and sink dynamics across LLaMa2 models, linking dark signals to attention-sinks such as the BoS token. The findings suggest avenues for spectral compression of the residual stream and motivate finer-grained analysis of dark subspaces and attention-bar phenomena, with implications for model transparency and safety.
Abstract
Projecting intermediate representations onto the vocabulary is an increasingly popular interpretation tool for transformer-based LLMs, also known as the logit lens. We propose a quantitative extension to this approach and define spectral filters on intermediate representations based on partitioning the singular vectors of the vocabulary embedding and unembedding matrices into bands. We find that the signals exchanged in the tail end of the spectrum are responsible for attention sinking (Xiao et al. 2023), of which we provide an explanation. We find that the loss of pretrained models can be kept low despite suppressing sizable parts of the embedding spectrum in a layer-dependent way, as long as attention sinking is preserved. Finally, we discover that the representation of tokens that draw attention from many tokens have large projections on the tail end of the spectrum.
