Neural population geometry and optimal coding of tasks with shared latent structure

Albert J. Wakhloo; Will Slatton; SueYeon Chung

Neural population geometry and optimal coding of tasks with shared latent structure

Albert J. Wakhloo, Will Slatton, SueYeon Chung

TL;DR

The paper introduces a Gaussian-equivalence, geometry-driven theory that links population-level neural activity statistics to the generalization performance of linear readouts across tasks tied to a shared latent structure. It identifies four key geometric terms—$c$ (neural-latent correlation), $f$ (signal-signal factorization), $s$ (signal-noise factorization), and $PR(\Psi)$ (neural dimensionality)—that fully determine cross-task performance and predicts that disentangled representations are optimal. The authors demonstrate how optimal codes compress less informative latent variables when data are scarce and expand them as data become abundant, with the eigen-spectrum becoming flatter as experience grows. They validate the theory on synthetic MLPs and macaque V4/IT data, showing accurate predictions of readout generalization and revealing distinct geometric signatures along the ventral stream. The work provides a principled link between neural population geometry and multi-task learning, offering testable predictions for neural coding and learning dynamics across both artificial and biological systems.

Abstract

Humans and animals can recognize latent structures in their environment and apply this information to efficiently navigate the world. However, it remains unclear what aspects of neural activity contribute to these computational capabilities. Here, we develop an analytical theory linking the geometry of a neural population's activity to the generalization performance of a linear readout on a set of tasks that depend on a common latent structure. We show that four geometric measures of the activity determine performance across tasks. Using this theory, we find that experimentally observed disentangled representations naturally emerge as an optimal solution to the multi-task learning problem. When data is scarce, these optimal neural codes compress less informative latent variables, and when data is abundant, they expand these variables in the state space. We validate our theory using macaque ventral stream recordings. Our results therefore tie population geometry to multi-task learning.

Neural population geometry and optimal coding of tasks with shared latent structure

TL;DR

(neural-latent correlation),

(signal-signal factorization),

(signal-noise factorization), and

(neural dimensionality)—that fully determine cross-task performance and predicts that disentangled representations are optimal. The authors demonstrate how optimal codes compress less informative latent variables when data are scarce and expand them as data become abundant, with the eigen-spectrum becoming flatter as experience grows. They validate the theory on synthetic MLPs and macaque V4/IT data, showing accurate predictions of readout generalization and revealing distinct geometric signatures along the ventral stream. The work provides a principled link between neural population geometry and multi-task learning, offering testable predictions for neural coding and learning dynamics across both artificial and biological systems.

Abstract

Paper Structure (15 sections, 6 equations, 7 figures)

This paper contains 15 sections, 6 equations, 7 figures.

Abstract
Introduction
Results
Theory of multi-task learning
Optimal representation of latent variables
Geometry of multi-task learning in MLPs
Predicting readout performance of macaque visual representations
Discussion
Methods
Model of multi-task learning
Geometric decomposition of generalization error across tasks
Gaussian simulations
Optimal codes
MLP Experiments
Macaque analyses

Figures (7)

Figure 1: Schematic of the task and model setup using images from the d-sprites dataset as an example dsprites17. (a) Visual stimuli are generated from points in a latent space, and binary discrimination tasks are formed by linearly separating the latent space using a hyperplane with normal $T_1$. (b) Each stimulus elicits a neuronal activity pattern, visualized as points in an activity space. (c) A new binary discrimination task can be formed by separating the latent space using a different hyperplane with normal $T_2$. (d) As before, stimuli elicit neural activity patterns. (e-f) Schematic of the Hebbian learning rule we consider. Given a set of $p=12$ training stimuli, subsequent decisions are made using a linear readout of the neural activity patterns, in this case the firing rates of 3 neurons. When the number of positive (red) and negative (blue) labels is balanced, this linear readout corresponds to using a hyperplane whose normal points in the direction of the difference in the means of positive and negative examples .
Figure 2: Schematic of the geometric terms. We visualize different possible neuronal activity patterns elicited by the same set of stimuli. (a) A small slice of the latent space from which stimuli are generated. (b) Visualization of neural activity patterns with low (left) and high (right) total correlation. When the correlation is high, the relative distances between points in the latent space are approximately preserved in the neural state space. (c) Signal-signal factorization (SSF). When the SSF is low, different latent variables are represented along overlapping directions, and when it is high, independent directions in the latent space are represented along approximately orthogonal directions in the neural state space. (d) Signal-noise factorization (SNF). When the SNF is low, the noise distribution (grey ellipses) around a point in the firing rate space falls along the coding directions. When it is high, the noise distribution is orthogonal to these directions. (e) Neural dimension. In higher dimensional representations, the neural activity and associated noise distribution occupies more directions in the state space, shown here as 2d (left) vs. 3d (right) noise distributions. As the dimension increases, the projection of a sample of neural activity onto a given direction becomes increasingly concentrated, supporting generalization performance sorscher2022neural.
Figure 3: Theory predicts empirical generalization error in Gaussian model with power law covariance spectra. (a) Schematic illustrating simulation setup. Gaussian latent variables $z$ are used to generate task labels, and predictions are formed using a linear transformation of the latent variables, $x=Az$. (b-d) Two typical units for the (b) latent variables, (c) random high dimensional projection, and (d) whitened transform for various values of the spectral decay exponent, $\alpha.$ (e) Eigenvalues of the latent covariance, $\Omega$, for different decay rates, $\alpha.$ (f-h) Multi-task generalization error as a function of training samples, $p$, for (f) the latent variables themselves, (g) the random projection, and (h) the whitened transform.
Figure 4: Optimal representational geometry as a function of training samples and latent structure. (a) In our task setup, directions in the latent space that have little variance are, on average, less informative of the task labels (Methods). Optimal neuronal representations of these latent variables map independent directions in the latent space to independent directions in the neuronal space. The amount of variance corresponding to less informative directions in the latent space is small when data is scarce and is large when data is abundant. (b) Eigenvalues of the optimal neural covariance as a function of the number of samples vs. latent dimension, $p/d$. We show the eigenvalues of the latent variables' covariance in black. Markers correspond to results obtained by optimizing our formula for the generalization error numerically, and solid lines correspond to our formula for the optimal code's spectrum (Methods). As the number of samples increases, the spectrum of the optimal neural code becomes increasingly flat, indicating the expansion strategy described above. (c-e) Geometric terms of the optimal neural representation for various values of $p$, given the same latent covariance. (Note that we do not plot the signal-noise factorization, as it diverges for the optimal representation for all $p$.) Surprisingly, the total correlation decreases with $p$.
Figure 5: Theory predicts generalization error of the Hebb rule in trained and random MLPs. (a) Schematic of the experiment. Latent variables $z$ are randomly shattered to generate task labels. These latents are passed through a random MLP (light blue) and are then used as inputs to train a 3 hidden layer MLP (dark blue) on the multi-task binary classification problem using stochastic gradient descent. (b-c) After training, we sample a new set of latents and teacher vectors and calculate the generalization error of the Hebb rule on each layer of the (b) random and (c) trained network. Theoretical predictions closely track empirical errors, and the trained network achieves a lower error in intermediate layers. (d-g) Geometric terms across layers for the random (light blue line) and trained (dark blue line) networks. Linear layers are marked by circles and relu layers by squares. Interestingly, the error only slightly changes across linear and relu layers of the same model stage, in spite of sharp changes in the geometry. (h-k) Average change in the geometry before and after the application of relu. Here we show the mean and standard deviation of the difference between each geometric term before and after the relu nonlinearity is applied. In the trained network, the application of relu consistently causes increases in the dimension and signal-signal factorization, as well as decreases in the correlation.
...and 2 more figures

Neural population geometry and optimal coding of tasks with shared latent structure

TL;DR

Abstract

Neural population geometry and optimal coding of tasks with shared latent structure

Authors

TL;DR

Abstract

Table of Contents

Figures (7)