Table of Contents
Fetching ...

Spectral Filters, Dark Signals, and Attention Sinks

Nicola Cancedda

TL;DR

This work extends the logit lens with logit spectroscopy by defining spectral filters over embedding/unembedding singular vectors to probe dark subspaces in large language models. It shows that tail-spectrum signals govern attention sinking and that substantial portions of the spectrum can be suppressed without harming next-token prediction, provided sinking pathways are preserved. The authors introduce sink-preserving filters (Ω) and demonstrate robust generation and sink dynamics across LLaMa2 models, linking dark signals to attention-sinks such as the BoS token. The findings suggest avenues for spectral compression of the residual stream and motivate finer-grained analysis of dark subspaces and attention-bar phenomena, with implications for model transparency and safety.

Abstract

Projecting intermediate representations onto the vocabulary is an increasingly popular interpretation tool for transformer-based LLMs, also known as the logit lens. We propose a quantitative extension to this approach and define spectral filters on intermediate representations based on partitioning the singular vectors of the vocabulary embedding and unembedding matrices into bands. We find that the signals exchanged in the tail end of the spectrum are responsible for attention sinking (Xiao et al. 2023), of which we provide an explanation. We find that the loss of pretrained models can be kept low despite suppressing sizable parts of the embedding spectrum in a layer-dependent way, as long as attention sinking is preserved. Finally, we discover that the representation of tokens that draw attention from many tokens have large projections on the tail end of the spectrum.

Spectral Filters, Dark Signals, and Attention Sinks

TL;DR

This work extends the logit lens with logit spectroscopy by defining spectral filters over embedding/unembedding singular vectors to probe dark subspaces in large language models. It shows that tail-spectrum signals govern attention sinking and that substantial portions of the spectrum can be suppressed without harming next-token prediction, provided sinking pathways are preserved. The authors introduce sink-preserving filters (Ω) and demonstrate robust generation and sink dynamics across LLaMa2 models, linking dark signals to attention-sinks such as the BoS token. The findings suggest avenues for spectral compression of the residual stream and motivate finer-grained analysis of dark subspaces and attention-bar phenomena, with implications for model transparency and safety.

Abstract

Projecting intermediate representations onto the vocabulary is an increasingly popular interpretation tool for transformer-based LLMs, also known as the logit lens. We propose a quantitative extension to this approach and define spectral filters on intermediate representations based on partitioning the singular vectors of the vocabulary embedding and unembedding matrices into bands. We find that the signals exchanged in the tail end of the spectrum are responsible for attention sinking (Xiao et al. 2023), of which we provide an explanation. We find that the loss of pretrained models can be kept low despite suppressing sizable parts of the embedding spectrum in a layer-dependent way, as long as attention sinking is preserved. Finally, we discover that the representation of tokens that draw attention from many tokens have large projections on the tail end of the spectrum.
Paper Structure (17 sections, 5 equations, 20 figures, 7 tables)

This paper contains 17 sections, 5 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Spectral filters project signals exchanged between components onto selected subspaces as defined by the spectral decomposition of the vocabulary embedding and unembedding matrices of the model.
  • Figure 2: Distribution of the singular values of the unembedding matrix $W_u$ of LLaMa2 13B. The U-Dark subspace is the one spanned by the last 5% right singular vectors.
  • Figure 3: The projections of four $W_o$ matrices of LLaMa2 70B on the RSVs of $W_u$. Different heads are equipped to write into different subspaces, with some targeting the dark subspace.
  • Figure 4: The projection of the rows of $W_2$ at L0 of LLaMa2 13B on the RSVs of $W_u$. Note the large values at the very right end of the spectrum, indicating the ability to write in the U-Dark space.
  • Figure 5: $\Psi$ filters project vectors onto subspaces that are dark according to both the embedding and the unembedding matrix decomposition.
  • ...and 15 more figures