Mathematical Models of Computation in Superposition

Kaarel Hänni; Jake Mendel; Dmitry Vaintrob; Lawrence Chan

Mathematical Models of Computation in Superposition

Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan

TL;DR

This work develops a mathematical framework for computation in neural networks under superposition, where many features are encoded beyond the neuron count. It introduces the Universal AND (U-AND) circuit as a canonical testbed and shows that a single-layer MLP can epsilon-linearly represent all pairwise ANDs with sublinear width, even when inputs are in superposition. The authors extend these ideas to arbitrary sparse Boolean circuits, and introduce error-correction layers that enable deep networks to emulate circuits with width around $m^{2/3}$ and any polynomial depth, under sparsity and mild randomness assumptions. Together, these results illuminate how computation can be distributed across high-dimensional activation spaces and have implications for mechanistic interpretability and the analysis of polysemantic representations in real models.

Abstract

Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emph{representational} superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the $\binom{m}{2}$ pairs of each of $m$ features. We construct a 1-layer MLP that uses superposition to perform this task up to $\varepsilon$-error, where the network only requires $\tilde{O}(m^{\frac{2}{3}})$ neurons, even when the input features are \emph{themselves in superposition}. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width $d$ to emulate circuits of width $\tilde{O}(d^{1.5})$ and \emph{any} polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.

Mathematical Models of Computation in Superposition

TL;DR

and any polynomial depth, under sparsity and mild randomness assumptions. Together, these results illuminate how computation can be distributed across high-dimensional activation spaces and have implications for mechanistic interpretability and the analysis of polysemantic representations in real models.

Abstract

pairs of each of

features. We construct a 1-layer MLP that uses superposition to perform this task up to

-error, where the network only requires

neurons, even when the input features are \emph{themselves in superposition}. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width

to emulate circuits of width

and \emph{any} polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.

Paper Structure (36 sections, 28 theorems, 97 equations, 4 figures, 3 tables)

This paper contains 36 sections, 28 theorems, 97 equations, 4 figures, 3 tables.

Introduction
Background and setup
Notation and conventions
Asymptotic complexity and $\tilde{O}$ notation
Fully connected neural networks
Features and feature vectors
Sparse boolean circuits
Strong and weak linear representations
Comparison with Anthropic's Toy Model of Superposition
Universal ANDs: a model of single-layer MLP superposition
Superposition in MLP activations enables more efficient U-AND
Neural networks can implement efficient U-AND even with inputs in superposition
Randomly initialized neural networks linearly represent U-AND
MLPs as representing sparse boolean circuits
Boolean circuits in single layer MLPs
...and 21 more sections

Key Result

Theorem 1

Fix a sparsity parameter $s\in \mathbb{N}.$ Then for large input length $m$, there exists a single-layer neural network ${\mathcal{M}_{w}}(x) = \mathrm{MLP}(x) = \mathrm{ReLU}({W_{\textrm{in}}} x + {w_{\textrm{bias}}})$ that $\varepsilon$-linearly represents the universal AND circuit $\mathcal{C}_

Figures (4)

Figure 1: The naive way to linearly represent the pairwise ANDs of $m$ boolean variables using an MLP is to use one neuron to compute the AND of each pair of variables (left). This requires $\binom{m}{2} = O(m^2)$ neurons. However, when inputs are sparse, there is a much more efficient implementation using superposition (right). Here, each neuron checks for whether or not at least two variables are active in a subset of random variables. Then, for any pair of variables, we can read off the AND of that pair by averaging together the activations of all neurons corresponding to the subsets containing both variables. With appropriately chosen subsets, we can $\varepsilon$-linearly represent all pairwise ANDs using only $\tilde{O}(m^{\frac{2}{3}})$ neurons, even when the inputs are themselves represented in superposition (Section \ref{['sec:one-layer-mlp-u-and']}).
Figure 2: In Section \ref{['sec:linear-representations']}, we distinguish between boolean features that are $\varepsilon$-linearly represented (left), $\mathrm{ReLU}$-linearly represented (center left), and those that are only linearly separable (i.e. weakly linearly represented) (center right). Red/blue indicates the presence or absence of the feature. In addition to being linearly separable, $\varepsilon$-linearly represented features must satisfy the further condition that the variance in the readoff direction $\vec{r}_k$within the positive and negative clusters is small compared to the margin between the two.
Figure 3: When two features $f_1, f_2$ are $\varepsilon$-linearly represented in activations $a(x)$, we can use two MLP neurons with input weights $\vec{r}_1, \vec{r}_2$ to read-off the two features, after which $f_1 \land f_2$ and $f_1 \lor f_2$ are $\varepsilon$-linearly represented in the MLP activations $\mathrm{MLP}(a(x))$. However, because linearly-separable features can have arbitrarily small margin, there might exist no MLP such that $f_1 \land f_2$ and $f_1 \lor f_2$ are linearly separable in $\mathrm{MLP}(a(x))$.
Figure 4: As discussed in Section \ref{['sec:sparse-circuit-single-mlp']}, our U-AND construction can be extended to allow for arbitrarily high fan-in ANDs, which in turn allows for single-layer MLPs that linearly represent all small boolean circuits.

Theorems & Definitions (62)

Definition 1: Weak linear representations
Definition 2: $\varepsilon$-linear representations
Definition 3: ReLU-linear representations
Definition 4: The universal AND boolean circuit
Theorem 1: U-AND with basis-aligned inputs
proof
Theorem 2: U-AND with inputs in superposition
proof
Theorem 3: Randomly initialized MLPs linearly represent U-AND
proof
...and 52 more

Mathematical Models of Computation in Superposition

TL;DR

Abstract

Mathematical Models of Computation in Superposition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (62)