From superposition to sparse codes: interpretable representations in neural networks
David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane
TL;DR
The paper addresses how neural networks encode information when multiple concepts are represented in a single distributed vector by proposing that representations exhibit superposition, i.e., linear additivity in the latent space. It develops a three-step framework: (1) identifiability theory showing latent features can be recovered up to a linear transformation from supervised learning, (2) sparse coding and compressed sensing to lift sparse, interpretable factors from superposed activations, and (3) quantitative interpretability metrics to evaluate recovered features. A mathematical world model with latent variables $z$, data $x=g(z)$, and representation $y=f(x)$ shows that the composite map $h=f\\circ g$ is linear and invertible under reasonable assumptions, enabling sparse decoding when $M\\ll N$. The authors connect theory to prior AI and neuroscience work, discuss sparse autoencoders and interpretability tasks like Word Intrusion Tasks, and argue that this framework supports AI transparency and a deeper understanding of neural coding. Practical implications include guiding the design of interpretable representations, scalable inference for sparse codes, and robust interpretability evaluations across artificial and biological systems.
Abstract
Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations. We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations. Our theoretical framework consists of three steps: (1) Identifiability theory shows that neural networks trained for classification recover latent features up to a linear transformation. (2) Sparse coding methods can extract disentangled features from these representations by leveraging principles from compressed sensing. (3) Quantitative interpretability metrics provide a means to assess the success of these methods, ensuring that extracted features align with human-interpretable concepts. By bridging insights from theoretical neuroscience, representation learning, and interpretability research, we propose an emerging perspective on understanding neural representations in both artificial and biological systems. Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.
