Mathematical Models of Computation in Superposition
Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, Lawrence Chan
TL;DR
This work develops a mathematical framework for computation in neural networks under superposition, where many features are encoded beyond the neuron count. It introduces the Universal AND (U-AND) circuit as a canonical testbed and shows that a single-layer MLP can epsilon-linearly represent all pairwise ANDs with sublinear width, even when inputs are in superposition. The authors extend these ideas to arbitrary sparse Boolean circuits, and introduce error-correction layers that enable deep networks to emulate circuits with width around $m^{2/3}$ and any polynomial depth, under sparsity and mild randomness assumptions. Together, these results illuminate how computation can be distributed across high-dimensional activation spaces and have implications for mechanistic interpretability and the analysis of polysemantic representations in real models.
Abstract
Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emph{representational} superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of \emph{computation} in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the $\binom{m}{2}$ pairs of each of $m$ features. We construct a 1-layer MLP that uses superposition to perform this task up to $\varepsilon$-error, where the network only requires $\tilde{O}(m^{\frac{2}{3}})$ neurons, even when the input features are \emph{themselves in superposition}. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width $d$ to emulate circuits of width $\tilde{O}(d^{1.5})$ and \emph{any} polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.
