How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Nikhil Garg; Jon Kleinberg; Kenny Peng

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Nikhil Garg, Jon Kleinberg, Kenny Peng

TL;DR

The paper formalizes the linear representation hypothesis (LRH) by distinguishing linear representation (activations lie in a linear span of feature directions) from linear accessibility (features can be linearly decoded). It then analyzes how many features $m$ can be stored in a layer of $d$ neurons under sparsity, comparing classical compressed sensing (nonlinear decoding) to linear compressed sensing (linear decoding). The authors prove nearly matching bounds: $d = O_\epsilon(k^2\log m)$ suffices, while $d = \Omega_\epsilon\big(\frac{k^2}{\log k}\log\frac{m}{k}\big)$ is necessary (for certain ranges of $\epsilon$), highlighting a quantitative gap between the two notions and showing that linear accessibility is strictly stronger. The results demonstrate that, under reasonable sparsity, a single layer can store an exponential number of features relative to the number of neurons, supporting the superposition view of LRH, and they extend to decoders with activations and biases, as well as to binary/classification settings. The work thus provides a rigorous mathematical foundation for LRH, clarifies the geometry of feature directions, and suggests that LRH could meaningfully constrain and explain the representational capacity of language models across layers.

Abstract

We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of language models store features linearly. We separate the hypothesis into two claims: linear representation (features are linearly embedded in neuron activations) and linear accessibility (features can be linearly decoded). We then ask: How many neurons $d$ suffice to both linearly represent and linearly access $m$ features? Classical results in compressed sensing imply that for $k$-sparse inputs, $d = O(k\log (m/k))$ suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that $d = Ω_ε(\frac{k^2}{\log k}\log (m/k))$ is required while $d = O_ε(k^2\log m)$ suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the "superposition hypothesis" (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán's theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias.

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

TL;DR

can be stored in a layer of

neurons under sparsity, comparing classical compressed sensing (nonlinear decoding) to linear compressed sensing (linear decoding). The authors prove nearly matching bounds:

suffices, while

is necessary (for certain ranges of

), highlighting a quantitative gap between the two notions and showing that linear accessibility is strictly stronger. The results demonstrate that, under reasonable sparsity, a single layer can store an exponential number of features relative to the number of neurons, supporting the superposition view of LRH, and they extend to decoders with activations and biases, as well as to binary/classification settings. The work thus provides a rigorous mathematical foundation for LRH, clarifies the geometry of feature directions, and suggests that LRH could meaningfully constrain and explain the representational capacity of language models across layers.

Abstract

suffice to both linearly represent and linearly access

features? Classical results in compressed sensing imply that for

-sparse inputs,

suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that

is required while

suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the "superposition hypothesis" (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán's theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias.

Paper Structure (29 sections, 10 theorems, 61 equations, 1 figure)

This paper contains 29 sections, 10 theorems, 61 equations, 1 figure.

Introduction
Mathematical Framework
Activations and features.
Feature accessibility.
The Linear Representation Hypothesis.
Superposition.
Results
Past Results: Compressed Sensing.
Our Results: Linear Compressed Sensing.
Further results.
Intuition and Proof Techniques
Upper bound sketch.
Initial attempt at lower bound.
Lower bound sketch.
Contributions: Implications for the Linear Representation Hypothesis and the Theory of Language Models
...and 14 more sections

Key Result

Theorem 1

There exists a matrix $A\in \mathbb{R}^{d\times m}$ with $d = O\left(k\log \frac{m}{k}\right)$ such that for all $k$-sparse $z\in \mathbb{R}^m$, satisfies

Figures (1)

Figure 1: Representation vectors $a_{\texttt{cat}}, a_{\texttt{happy}}$ and probe vectors $b_{\texttt{cat}}, b_{\texttt{happy}}$ that yield perfect linear recovery of the cat and happy features. Notice that while the representation vectors are not orthogonal, the probe vectors are able perfectly extract features since the cat probe is orthogonal to the happy representation and the happy probe is orthogonal to the cat representation. Indeed, observe that $b_{\texttt{cat}}(z_{\texttt{cat}}a_{\texttt{cat}} + z_{\texttt{happy}}a_{\texttt{happy}}) = z_{\texttt{cat}}\langle b_{\texttt{cat}}, a_{\texttt{cat}}\rangle + z_{\texttt{happy}}\langle b_{\texttt{cat}}, a_{\texttt{happy}}\rangle = z_{\texttt{cat}}\cdot 1 + z_{\texttt{happy}}\cdot 0.$

Theorems & Definitions (19)

Theorem 1: Compressed Sensing
Theorem 2: Upper Bound
Theorem 3: Lower Bound
Definition 4: $\mu$-incoherence
Lemma 5: Incoherence of Random Matrices
proof
Remark 6
Lemma 7: alon2003problems, Theorem 9.3
Corollary 8
proof
...and 9 more

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

TL;DR

Abstract

How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (19)