Table of Contents
Fetching ...

Measuring the Representational Alignment of Neural Systems in Superposition

Sunny Liu, Habon Issa, André Longon, Liv Gorton, Meenakshi Khosla, David Klindt

Abstract

Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems' respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.

Measuring the Representational Alignment of Neural Systems in Superposition

Abstract

Comparing the internal representations of neural networks is a central goal in both neuroscience and machine learning. Standard alignment metrics operate on raw neural activations, implicitly assuming that similar representations produce similar activity patterns. However, neural systems frequently operate in superposition, encoding more features than they have neurons via linear compression. We derive closed-form expressions showing that superposition systematically deflates Representational Similarity Analysis, Centered Kernel Alignment, and linear regression, causing networks with identical feature content to appear dissimilar. The root cause is that these metrics are dependent on cross-similarity between two systems' respective superposition matrices, which under assumption of random projection usually differ significantly, not on the latent features themselves: alignment scores conflate what a system represents with how it represents it. Under partial feature overlap, this confound can invert the expected ordering, making systems sharing fewer features appear more aligned than systems sharing more. Crucially, the apparent misalignment need not reflect a loss of information; compressed sensing guarantees that the original features remain recoverable from the lower-dimensional activity, provided they are sparse. We therefore argue that comparing neural systems in superposition requires extracting and aligning the underlying features rather than comparing the raw neural mixtures.

Paper Structure

This paper contains 37 sections, 3 theorems, 47 equations, 4 figures.

Key Result

Theorem 3.1

The RSA correlation between two representations $Y_a$ and $Y_b$ in superposition is the cosine similarity between their respective Gram matrices, $G_a = A_a^{\mathsf{T}} A_a$ and $G_b = A_b^{\mathsf{T}} A_b$: where $\langle \cdot, \cdot \rangle_F$ and $\|\cdot\|_F$ are the Frobenius inner product and norm, respectively. $\blacktriangleleft$$\blacktriangleleft$

Figures (4)

  • Figure 1: Illustration of core idea.Left)Superposition: Two neural networks share an identical set of latent features ($Z_a = Z_b$), but compress them (red arrows) via different projection matrices, yielding distinct neural activations $Y_a \neq Y_b$. Computing alignment over these raw activations leads to artificially low representational similarity. Middle)Linear regression: Assuming perfect latent recovery, the maximum pairwise correlation between latent activations is $1.0$, and will be greater than the correlation between raw neural activations. Right)Representational similarity analysis: RSA first computes pairwise (dis)similarity matrices of neural responses, then correlates these matrices to produce an alignment score. As with linear regression, the RSA score for perfectly recovered latents is $1.0$, and greater than the RSA score over neural activations.
  • Figure 2: Neural Alignment Decreases with Superposition. Alignments measured with RSA (Top Left), Linear Regression $R^2$ (Top Right), and CKA with Linear Kernel (Bottom Left), as a function of system dimension ($m$ in units of $k \ln \frac{n}{k}$). This experiment is repeated across multiple sparsity levels ($k$). Analytical predictions are represented by solid curves, while empirical results from simulation across different superposition compressions are represented by the dots. We note where accurate latent recovery from compressed representations is (CS; green shading) or is not (No CS; red shading) possible donoho_compressed_2006.
  • Figure 3: Impact of Feature Overlap on Neural Alignment under Superposition Alignment measured with CKA with Linear Kernel as a function of overlap ratio. This experiment is repeated across multiple levels of system dimension $m$, here given in units of $k\ln{\frac{l}{k}}$ . Higher system dimension indicates less superposition. Analytical predictions are represented by solid curves, while empirical results from simulation across different superposition compressions are represented by the dots.
  • Figure 4: Superposition obscures alignments under partial feature overlap Alignment measured with CKA with Linear Kernel as a function of System b dimension across multiple feature overlap ratios ($u$). This experiment is repeated over several sparsity fractions $k/l$. Analytical predictions are represented by solid curves, while empirical results from simulation across different superposition compressions are represented by the dots. We also add a red dashed lines to denote the minimum alignment under perfect feature sharing (i.e. $u=1$). Above the red line is region where it is possible for systems with partial feature sharing (i.e. $u<1$) attain higher alignment than systems with perfect feature sharing.

Theorems & Definitions (4)

  • Definition 2.1: Superposition
  • Theorem 3.1: Asymptotic RSA Alignment
  • Theorem 3.2: Asymptotic Linear CKA Alignment
  • Theorem 3.3: Asymptotic Linear Regression