Table of Contents
Fetching ...

From superposition to sparse codes: interpretable representations in neural networks

David Klindt, Charles O'Neill, Patrik Reizinger, Harald Maurer, Nina Miolane

TL;DR

The paper addresses how neural networks encode information when multiple concepts are represented in a single distributed vector by proposing that representations exhibit superposition, i.e., linear additivity in the latent space. It develops a three-step framework: (1) identifiability theory showing latent features can be recovered up to a linear transformation from supervised learning, (2) sparse coding and compressed sensing to lift sparse, interpretable factors from superposed activations, and (3) quantitative interpretability metrics to evaluate recovered features. A mathematical world model with latent variables $z$, data $x=g(z)$, and representation $y=f(x)$ shows that the composite map $h=f\\circ g$ is linear and invertible under reasonable assumptions, enabling sparse decoding when $M\\ll N$. The authors connect theory to prior AI and neuroscience work, discuss sparse autoencoders and interpretability tasks like Word Intrusion Tasks, and argue that this framework supports AI transparency and a deeper understanding of neural coding. Practical implications include guiding the design of interpretable representations, scalable inference for sparse codes, and robust interpretability evaluations across artificial and biological systems.

Abstract

Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations. We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations. Our theoretical framework consists of three steps: (1) Identifiability theory shows that neural networks trained for classification recover latent features up to a linear transformation. (2) Sparse coding methods can extract disentangled features from these representations by leveraging principles from compressed sensing. (3) Quantitative interpretability metrics provide a means to assess the success of these methods, ensuring that extracted features align with human-interpretable concepts. By bridging insights from theoretical neuroscience, representation learning, and interpretability research, we propose an emerging perspective on understanding neural representations in both artificial and biological systems. Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.

From superposition to sparse codes: interpretable representations in neural networks

TL;DR

The paper addresses how neural networks encode information when multiple concepts are represented in a single distributed vector by proposing that representations exhibit superposition, i.e., linear additivity in the latent space. It develops a three-step framework: (1) identifiability theory showing latent features can be recovered up to a linear transformation from supervised learning, (2) sparse coding and compressed sensing to lift sparse, interpretable factors from superposed activations, and (3) quantitative interpretability metrics to evaluate recovered features. A mathematical world model with latent variables , data , and representation shows that the composite map is linear and invertible under reasonable assumptions, enabling sparse decoding when . The authors connect theory to prior AI and neuroscience work, discuss sparse autoencoders and interpretability tasks like Word Intrusion Tasks, and argue that this framework supports AI transparency and a deeper understanding of neural coding. Practical implications include guiding the design of interpretable representations, scalable inference for sparse codes, and robust interpretability evaluations across artificial and biological systems.

Abstract

Understanding how information is represented in neural networks is a fundamental challenge in both neuroscience and artificial intelligence. Despite their nonlinear architectures, recent evidence suggests that neural networks encode features in superposition, meaning that input concepts are linearly overlaid within the network's representations. We present a perspective that explains this phenomenon and provides a foundation for extracting interpretable representations from neural activations. Our theoretical framework consists of three steps: (1) Identifiability theory shows that neural networks trained for classification recover latent features up to a linear transformation. (2) Sparse coding methods can extract disentangled features from these representations by leveraging principles from compressed sensing. (3) Quantitative interpretability metrics provide a means to assess the success of these methods, ensuring that extracted features align with human-interpretable concepts. By bridging insights from theoretical neuroscience, representation learning, and interpretability research, we propose an emerging perspective on understanding neural representations in both artificial and biological systems. Our arguments have implications for neural coding theories, AI transparency, and the broader goal of making deep learning models more interpretable.

Paper Structure

This paper contains 18 sections, 1 theorem, 10 equations, 4 figures, 1 table.

Key Result

Theorem 1

Let Assumption 1 in reizinger2024cross hold, and suppose that a continuous encoder $f: \mathbb{R}^D \rightarrow \mathbb{R}^d$ and a linear classifier $W$ globally minimize the cross-entropy objective. Then, the composition $h = f \circ g$ is a linear map from $\mathbb{S}^{d-1}$ to $\mathbb{R}^d$.

Figures (4)

  • Figure 1: Superposition of neural representations.A) Images generated from GPT-4 prompts: "A": an elephant, "B": a pink ball, and "A and B": an elephant and a pink ball; B)Test for additivity: Adding the neural representations (ViT-B/16 dosovitskiy_image_2021) of the first two images ($f \circ g(\text{"A"})$ and $f \circ g(\text{"B"})$) and the neural representation for the combined image ($f \circ g(\text{"A and B"})$) (c.s.: cosine similarity; each dot corresponds to one dimension of the neural representation); C) Real (not AI generated) images of a dog ("Hazel"), a cat ("Nemo") and both together. D) Same analysis as in B), but with the images in C), showing that additivity even holds on natural images. Further examples ($N=10$, not shown) support that this finding is statistically significant (over a calibrated baseline, i.e., the c.s. of the representations of just "A" and "B") in neural representation space ($f \circ g$, $p < 10^{-5}$) but not in pixel space ($g$, $p \approx 0.88$).
  • Figure 2: Theory and analysis pipeline. Data ($x$) arise from interpretable features ($z \in \mathbb{R}^N$) through a nonlinear function ($g$) and neural representations ($y \in \mathbb{R}^N$) arise from data through another nonlinear function ($f$). However, because neural representations have lower dimensionality ($M < N$), they overlay interpretable features in superpositionelhage_toy_2022. 1) Identifiability theory establishes that the overall mapping from interpretable features to neural representations must be linear. 2) Compressed sensing theory shows that sparse coding can lift sparse features out of superposition, recovering the original interpretable features up to permutations. 3) Since the true interpretable features are unknown, we evaluate the success of sparse coding using a permutation-invariant measure of interpretability as a proxy.
  • Figure 3: Mathematical world model and neural analogy making.A) The latent variables (i.e., features, see text) $z$ are nonlinearly mapped to the data $x=g(z)$. The data is nonlinearly mapped to neural representations $y=f(x)$. Under the assumption that $h=f \circ g$ is invertible (and the requirement that all $z$ are important for the task that $f$ is trained to solve), the key insight from identifiability theory is that $h(z)=y$ will be linear (Theorem \ref{['thm:ident_theo_supervised']}). The neural representation is likely a lower-dimensional representation of the, potentially, high-dimensional latent variables, i.e., $h: \mathbb{R}^N \rightarrow \mathbb{R}^M$ with $M \ll N$. Under additional assumptions (e.g., $z$ being sparse) donoho2006short, the key insight from compressed sensing theory is that we can perform sparse coding to decode sparse codes from the neural representation $y \rightarrow \hat{z}$, allowing us to uniquely identify the true latent variables up to permutations $z \sim_P \hat{z}$. B) A generative model (GPT-4) that maps latent variables, here represented as text, to images $g(z)=x$. C) A neural representation (ViT-B/16 dosovitskiy_image_2021) of the generated images $f(x)$, demonstrating analogy making. This shows how concepts combine linearly, at least for this local example, in neural representation space (c.s.: cosine similarity).
  • Figure 4: Compressed sensing bounds. For various embedding sizes ($M$), the plots show lower bounds for signal reconstruction with high probability (omitting any constant scaling factors) donoho2006short. The boundary between invertible and non-invertible regions is defined by $M = K \log(N/K)$. Shown are embeddings for hidden dimension sizes of models such as word2vecmikolov_efficient_2013, CLIPradford_learning_2021, and LLAMAv1 with 6.7B parameters touvron_llama_2023.

Theorems & Definitions (3)

  • Definition 1: Smolensky superposition
  • Theorem 1: Supervised Learning Identifiability reizinger2024cross
  • Definition 2: Superposition