Table of Contents
Fetching ...

Superposition disentanglement of neural representations reveals hidden alignment

André Longon, David Klindt, Meenakshi Khosla

TL;DR

This work investigates how neural superposition—where single units encode multiple features—affects representational alignment between networks and brains. It develops a theoretical framework showing that permutation-based alignment scores deflate under superposition, with deflation tied to sparsity patterns and feature mixing; perfect alignment is achievable with exact sparse recovery under RIP conditions. Through toy two-layer autoencoders and large-scale DNN experiments (ImageNet, ResNet50, ViT-B/16) plus DNN→brain mappings using NSD, the authors demonstrate that disentangling features with TopK SAEs consistently increases alignment scores when comparing latent features across seeds or modalities, particularly in deeper layers. These results imply that superposition can mask shared computational structure and that disentanglement is a crucial step for accurate cross-model and brain alignment, potentially reshaping NeuroAI methodologies and interpretations of representational similarity.

Abstract

The superposition hypothesis states that single neurons may participate in representing multiple features in order for the neural network to represent more features than it has neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: does superposition interact with alignment metrics in any undesirable way? We hypothesize that models which represent the same features in different superposition arrangements, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We develop a theory for how permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model's base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN-DNN and DNN-brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural networks.

Superposition disentanglement of neural representations reveals hidden alignment

TL;DR

This work investigates how neural superposition—where single units encode multiple features—affects representational alignment between networks and brains. It develops a theoretical framework showing that permutation-based alignment scores deflate under superposition, with deflation tied to sparsity patterns and feature mixing; perfect alignment is achievable with exact sparse recovery under RIP conditions. Through toy two-layer autoencoders and large-scale DNN experiments (ImageNet, ResNet50, ViT-B/16) plus DNN→brain mappings using NSD, the authors demonstrate that disentangling features with TopK SAEs consistently increases alignment scores when comparing latent features across seeds or modalities, particularly in deeper layers. These results imply that superposition can mask shared computational structure and that disentanglement is a crucial step for accurate cross-model and brain alignment, potentially reshaping NeuroAI methodologies and interpretations of representational similarity.

Abstract

The superposition hypothesis states that single neurons may participate in representing multiple features in order for the neural network to represent more features than it has neurons. In neuroscience and AI, representational alignment metrics measure the extent to which different deep neural networks (DNNs) or brains represent similar information. In this work, we explore a critical question: does superposition interact with alignment metrics in any undesirable way? We hypothesize that models which represent the same features in different superposition arrangements, i.e., their neurons have different linear combinations of the features, will interfere with predictive mapping metrics (semi-matching, soft-matching, linear regression), producing lower alignment than expected. We develop a theory for how permutation metrics are dependent on superposition arrangements. This is tested by training sparse autoencoders (SAEs) to disentangle superposition in toy models, where alignment scores are shown to typically increase when a model's base neurons are replaced with its sparse overcomplete latent codes. We find similar increases for DNN-DNN and DNN-brain linear regression alignment in the visual domain. Our results suggest that superposition disentanglement is necessary for mapping metrics to uncover the true representational alignment between neural networks.

Paper Structure

This paper contains 37 sections, 14 equations, 7 figures.

Figures (7)

  • Figure 1: A visual depiction of the question: does superposition disentanglement increase representational alignment? Features are represented as colors, and neurons may arrange multiple features in superposition. The left side of the inequality shows an alignment metric taken between the base neurons of two models which represent the same features but in different superposition arrangements. The right side shows each model with their superimposed representations disentangled via a projection into an ideal sparse overcomplete space, where each dimension represents an individual feature. The same alignment metric is taken over these sparse latent codes in lieu of the base neurons.
  • Figure 2: Feature overlap (left, top row) and the superposition arrangement comparison (left, bottom row) of the shared features (multiplied norms $\geq 1$) between the differently seeded toy models ($N=8, 16, 32$ models are displayed along columns). Soft-matching alignment (right) sees a significant increase when SAE latents are replaced with base neurons for $N=16 \text{ and } 32$.
  • Figure 3: Soft-matching alignment between differently seeded DNNs (left: ResNet50, right: ViT-B/16) trained on ImageNet object classification.
  • Figure 4: Regression correlation scores between the same toy models (top) and DNNs (bottom).
  • Figure 5: Semi-matching correlations between features and toy model neurons (cyan) and SAE latents (magenta). Vertical dashed lines are the respective means across features, where the SAE latents have higher mean correlations over neurons across all models.
  • ...and 2 more figures