Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

Compressible Softmax-Attended Language under Incompressible Attention

Wonsuk Lee

Abstract

Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field $\tilde{E}$ reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix $W_Q^\mathrm{T} W_K$ needs 38--75 components for the same threshold out of $d_h \in \{64, 128\}$. The spectral gap is $5$--$25\times$ in effective rank. The attention mechanism allocates capacity uniformly across all $d_h$ dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Compressible Softmax-Attended Language under Incompressible Attention

Abstract

Across every attention head in five transformer language models (124M--7B parameters, four architecture families), the logit energy field

reaches 90\% of its variance in 2--11 singular components. The \emph{learned} interaction matrix

needs 38--75 components for the same threshold out of

. The spectral gap is

in effective rank. The attention mechanism allocates capacity uniformly across all

dimensions, but language concentrates the actual interaction into a few. The compressibility of softmax-attended language is a property of the data, not the frame that analyzes it.

Compressible Softmax-Attended Language under Incompressible Attention

Abstract

Compressible Softmax-Attended Language under Incompressible Attention

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (4)