Table of Contents
Fetching ...

Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

Andres Saurez, Yousung Lee, Dongsoo Har

TL;DR

This work explains why linear interpretability methods reliably uncover semantic structure in deep transformers by showing that any feature decoded through a linear interface must reside in a context-invariant subspace, a consequence of architectural constraints. It introduces the Self-Reference Property, which posits that class tokens themselves define the invariant feature directions, enabling zero-shot extraction of semantic directions and unsupervised probing. The authors validate their theory across eight semantic tasks and four model families using zero-shot probes, unsupervised transforms, and sparse autoencoders, all of which align with the same invariant directions. By unifying linear probes and sparse autoencoders under a principled geometric framework, the paper provides a concrete architectural explanation for the success of linear interpretability methods in transformers. The findings have practical implications for scalable, unsupervised circuit discovery and for evaluating representation dictionaries via token-derived directions.

Abstract

Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.

Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

TL;DR

This work explains why linear interpretability methods reliably uncover semantic structure in deep transformers by showing that any feature decoded through a linear interface must reside in a context-invariant subspace, a consequence of architectural constraints. It introduces the Self-Reference Property, which posits that class tokens themselves define the invariant feature directions, enabling zero-shot extraction of semantic directions and unsupervised probing. The authors validate their theory across eight semantic tasks and four model families using zero-shot probes, unsupervised transforms, and sparse autoencoders, all of which align with the same invariant directions. By unifying linear probes and sparse autoencoders under a principled geometric framework, the paper provides a concrete architectural explanation for the success of linear interpretability methods in transformers. The findings have practical implications for scalable, unsupervised circuit discovery and for evaluating representation dictionaries via token-derived directions.

Abstract

Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations -- yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.
Paper Structure (59 sections, 12 theorems, 32 equations, 13 figures, 2 tables)

This paper contains 59 sections, 12 theorems, 32 equations, 13 figures, 2 tables.

Key Result

Theorem 3.7

Let $\mathcal{M}$ be a transformer satisfying Assumption ass:architecture, and let $f$ be a communicable feature decoded through a linear interface $W$. Then there exists a context-invariant subspace $\mathcal{S}_f \subseteq \mathbb{R}^d$ such that the $f$-relevant component of $\mathbf{h}(c)$ lies

Figures (13)

  • Figure 1: Context-invariant directional representation. The explicit token “France” provides a reference vector for this direction (self-reference), while contextual mentions such as “I went to Paris” and “I visited Marseille” share the same invariant direction.
  • Figure 2: Linear readout layers constrain representation geometry. We train a transformer on modular division with an MLP classification head instead of linear unembedding. (Left) When the model finds a non-Fourier solution, embeddings lack circular structure and linear probes fail ($\sim$20% accuracy). (Right) When the model discovers Fourier structure, linear probes succeed. Across random seeds, linear probe accuracy correlates with the Fourier representations emerges, but are not required by MLP heads. Linear readout interfaces would necessitate such directional structure.
  • Figure 3: Token alignment validation of the Self-Reference Property across four datasets in LLaMA3-8B. Each point represents one attention head; the x-axis shows mean cosine similarity between class tokens and other-class implicit instances, while the y-axis shows similarity to same-class implicit instances. Points above the diagonal indicate stronger alignment with the correct class. Percentages indicate heads above diagonal: Countries 91.5%, Animals 97.6%, Cartoon Characters 86.0%, Emotions 89.6%.
  • Figure 4: SAE shared peak analysis across classes. We compare top-$k$ SAE dimensions of a class token with top-$k$ SAE dimensions derived from its instances. Red markers denote shared dimensions, revealing shared invariant features between tokens and contexts.
  • Figure 5: PCA and t-SNE projections of embeddings for the polysemous word Apple (fruit vs. company) using domain-specific tokens. (Left) A single "Apple" token representing both classes (69% accuracy). (Right) Separate tokens "Fruit apple" and "Company Apple" treated as distinct classes (65.7% accuracy). Both methods successfully disentangle the two senses, with instances clustering around their respective class prototypes.
  • ...and 8 more figures

Theorems & Definitions (33)

  • Definition 3.2: Context
  • Definition 3.3: Semantic Feature
  • Definition 3.4: Communicable Feature
  • Definition 3.5: Invariant Subspace
  • Definition 3.6: Directional Invariance
  • Theorem 3.7: Invariant Subspace Necessity
  • proof : Proof sketch
  • Proposition 3.8: Capacity Constraint Implies Feature Sharing
  • proof : Proof sketch
  • Remark 3.9: Implicit Classification Revisited
  • ...and 23 more