Table of Contents
Fetching ...

Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

Abstract

Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Compressible Softmax-Attended Language under Incompressible Attention

Abstract

Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix needs 38--75 components for the same threshold out of . The spectral gap is -- in effective rank. The attention mechanism allocates capacity uniformly across all dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Paper Structure

This paper contains 15 sections, 2 theorems, 14 equations, 2 tables.

Key Result

Proposition 4.1

Let $\tilde{E} \in \mathbb{R}^{L \times L}$ be the row-centered logit matrix with SVD $\tilde{E} = \sum_{k=1}^{R} \sigma_k\, u_k\, v_k^\mathrm{T}$, and let $\tilde{E}_r = \sum_{k=1}^{r} \sigma_k\, u_k\, v_k^\mathrm{T}$ be its rank-$r$ truncation. Let $p_i = \mathop{\mathrm{softmax}}\nolimits(\tilde{ for a parameter $\beta$, then for every row $i$: $\blacktriangleleft$$\blacktriangleleft$

Theorems & Definitions (4)

  • Proposition 4.1: Softmax stability of low-rank approximation
  • proof
  • Theorem A.1: Softmax Lipschitz bound
  • proof