Table of Contents
Fetching ...

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

Lucas Prieto, Edward Stevinson, Melih Barsbey, Tolga Birdal, Pedro A. M. Mediano

TL;DR

BOWS, a controlled setting to encode binary bag-of-words representations of internet text in superposition, finds that when features are correlated, interference can be constructive rather than just noise to be filtered out, and is more prevalent in models trained with weight decay.

Abstract

A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.

From Data Statistics to Feature Geometry: How Correlations Shape Superposition

TL;DR

BOWS, a controlled setting to encode binary bag-of-words representations of internet text in superposition, finds that when features are correlated, interference can be constructive rather than just noise to be filtered out, and is more prevalent in models trained with weight decay.

Abstract

A central idea in mechanistic interpretability is that neural networks represent more features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as introducing interference that must be minimized geometrically and filtered out by non-linearities such as ReLUs, yielding local structures like regular polytopes. We show that this account is incomplete for realistic data by introducing Bag-of-Words Superposition (BOWS), a controlled setting to encode binary bag-of-words representations of internet text in superposition. Using BOWS, we find that when features are correlated, interference can be constructive rather than just noise to be filtered out. This is achieved by arranging features according to their co-activation patterns, making interference between active features constructive, while still using ReLUs to avoid false positives. We show that this kind of arrangement is more prevalent in models trained with weight decay and naturally gives rise to semantic clusters and cyclical structures which have been observed in real language models yet were not explained by the standard picture of superposition. Code for this paper can be found at https://github.com/LucasPrietoAl/correlations-feature-geometry.
Paper Structure (55 sections, 13 equations, 19 figures, 1 table)

This paper contains 55 sections, 13 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: BOWS, our new framework to study superposition in realistic data (left) extends our current understanding of superposition (middle) by showing that interference can be constructive, allowing words like 'December' to contribute to the reconstruction of correlated words like 'Christmas' giving rise to a circular arrangement for the months of the year (right).
  • Figure 2: Autoencoding synthetic correlated features shows two ways of handling interference. Weight inner products ($\mathbf{W}^\top \mathbf{W}$) at convergence for AEs encoding $d=12$ features with cyclic covariance, varying latent size $m$. Top (Linear AE): captures the top-$m$ principal components projecting all 12 features on the circular structure induced by the data covariance. Bottom (ReLU AE): Matches linear AE for small $m$, but forms antipodal pairs for $m\in\{6,...,10\}$ using the $ReLU$ to filter interference.
  • Figure 3: Linear superposition appears in ReLU AEs which have small latent sizes (a) or are trained with weight decay (d), giving rise to semantic clusters. UMAP projections of word embeddings from AEs of different latent dimensions ($m$) and weight decay values ($wd$). Points are colored by semantic category (e).
  • Figure 4: Circular representation of months arises from data covariance via PCA.(a) Empirical correlation matrix of month words in the WikiText-103 BOWS dataset, showing cyclic correlations. (b) PCA applied directly to the 12 month dimensions of the BOWS data vectors, projected onto the top 2 PCs, reveals a circle. (c) PCA applied to the 12 learned encoder features ($W$ columns) for months from a ReLU AE trained on WikiText-BOWS ($V=10k$, $m=1000$), projected onto their top 2 PCs, also recovers the circular structure. Seasonal words such as 'Christmas' and 'summer' align with the months with which they co-occur, allowing 'December' to contribute to the reconstruction of 'Christmas' while interference on 'Christmas' cancels out if all months are present (d).
  • Figure 5: Constructive interference and interference filtering coexist in realistic data.(Left) Terms related to 'Beatles' achieve high validation $R^2$ despite poor one-hot reconstruction, indicating that they benefit from contextual interference. (Middle) For 81% of validation samples containing 'Beatles', interference improves reconstruction relative to the one-hot case. (Right) In supportive contexts, correlated words contribute positive pre-activation to 'Beatles'; when these contexts occur without the target word, the ReLU and negative bias suppress false positives.
  • ...and 14 more figures

Theorems & Definitions (4)

  • Definition 1: Superposition
  • Definition 2: Linear Superposition
  • Definition 3: Non-linear Superposition
  • Definition 4: Linear Representation Hypothesis