Mutual information and task-relevant latent dimensionality

Paarth Gulati; Eslam Abdelaleem; Audrey Sederberg; Ilya Nemenman

Mutual information and task-relevant latent dimensionality

Paarth Gulati, Eslam Abdelaleem, Audrey Sederberg, Ilya Nemenman

TL;DR

This work tackles the challenge of identifying the task-relevant latent dimensionality required to predict outcomes from high-dimensional observations. It casts the problem as a symmetric information bottleneck and shows that conventional neural MI estimators with separable or bilinear critics inflate the inferred dimension. To address this, the authors introduce a hybrid critic that preserves an explicit bottleneck while enabling nonlinear cross-view interactions, enabling reliable, one-shot estimation of the effective latent dimensionality via a cross-covariance participation ratio. The method remains robust to observation noise and extends to intrinsic dimensionality by view splitting, with successful validation on synthetic benchmarks and physics datasets such as the 2D Ising model and pendulum dynamics. Overall, the approach provides a practical, data-efficient tool for uncovering meaningful latent structure in noisy scientific data and offers new insight into the geometry of task-relevant representations.

Abstract

Estimating the dimensionality of the latent representation needed for prediction -- the task-relevant dimension -- is a difficult, largely unsolved problem with broad scientific applications. We cast it as an Information Bottleneck question: what embedding bottleneck dimension is sufficient to compress predictor and predicted views while preserving their mutual information (MI). This repurposes neural MI estimators for dimensionality estimation. We show that standard neural estimators with separable/bilinear critics systematically inflate the inferred dimension, and we address this by introducing a hybrid critic that retains an explicit dimensional bottleneck while allowing flexible nonlinear cross-view interactions, thereby preserving the latent geometry. We further propose a one-shot protocol that reads off the effective dimension from a single over-parameterized hybrid model, without sweeping over bottleneck sizes. We validate the approach on synthetic problems with known task-relevant dimension. We extend the approach to intrinsic dimensionality by constructing paired views of a single dataset, enabling comparison with classical geometric dimension estimators. In noisy regimes where those estimators degrade, our approach remains reliable. Finally, we demonstrate the utility of the method on multiple physics datasets.

Mutual information and task-relevant latent dimensionality

TL;DR

Abstract

Paper Structure (49 sections, 35 equations, 17 figures)

This paper contains 49 sections, 35 equations, 17 figures.

Introduction
Contributions.
Setup and Methodology
Results
Infinite Data Regime: high dimensional input, low dimensional latents
Noise in the observation space and intrinsic dimensionality
Single-shot dimensionality estimation via participation ratio
Finite data
Estimating task-relevant dimensionality of physics datasets
Ising model
Pendulum dynamics
Discussion
Theoretical Framework
Variational objectives and critic crchitectures
Variational MI estimators
...and 34 more sections

Figures (17)

Figure 1: Hybrid Critic Architecture. Retains the bottleneck for dimensionality analysis, but allows flexible mixing via a concatenated head $T_\theta$ (e.g. a small MLP).
Figure 2: Role of embedding size in the infinite-data (resampling) regime. Estimated MI versus encoder embedding size $k_z$ for two latent distributions: (A--C) jointly Gaussian latent ($K_Z=4$) with total MI $I=2.0$ bits (equal per latent dimension); (D--F) Gaussian mixture with $N_p=8$ equally likely joint-Gaussian clusters with $k_Z=1$ (each with $\rho\approx 0.97$), with cluster means on a circle of radius $\mu=2.0$ (see App. \ref{['app:details_synthetic_data']}). (A,D) Latent distributions. (B,E) MI estimates (maximum over 10 trials, individual trials shown with semi-transparent markers) for frozen linear observation maps. (C,F) Same, but with frozen nonlinear teacher maps (see App. \ref{['app:details_synthetic_data']}). Vertical dotted lines mark true task-relevant dimension, which is always matched by $k_z^\ast$ chosen by the hybrid critic.
Figure 3: Effect of additive observation noise (hybrid critic). Independent white noise is added after the frozen nonlinear observation map, with $\langle \eta_{\alpha,i}\eta_{\beta,j}\rangle=\sigma_\alpha^2\delta_{ij}\delta_{\alpha\beta}$ and strength set by the noise-to-signal ratio $\eta$. (A) Joint Gaussian latent ($K_Z=4$). (B) Gaussian mixture ($N_p=8$ components). Noise reduces the estimated MI, while the saturation point used to infer dimensionality is preserved. MI is the maximum over 10 trials; individual trials shown as semi-transparent markers.
Figure 4: Single-shot dimensionality from the embedding spectrum (hybrid critic). The participation ratio of the cross-covariance spectrum of the learned encoder embeddings provides a reliable estimate of latent dimensionality. (A) Normalized singular values of the cross-covariance (computed from $10^4$ samples) for a trained model with $k_z=19$ (inset: log-scale). A clear gap appears after the first $K_Z$ modes, yielding $d_{\rm eff}$ via Eq. \ref{['eq:deff_defintion']}. (B)$d_{\rm eff}$ saturates for $k_z\gtrsim K_Z$ for both representative latent distributions, indicating that the learned embeddings concentrate onto an effectively $K_Z$-dimensional subspace. (C,D)$d_{\rm eff}$ from a single over-parameterized model ($k_z=64$) versus $K_Z$ for jointly Gaussian latents: varying total MI at fixed batch size $N_B=128$ (C), and varying $N_B$ at fixed total MI $I=2$ bits (D). As always, semi-transparent markers denote individual trials, and error bars are standard deviations.
Figure 5: Dimensionality estimation with finite data.(A) Max-test, train-estimate protocol abdelaleem2025accurate for a jointly Gaussian latent ($K_Z=4$, $I=2$ bits) observed through a teacher map into $\mathbb{R}^{500}$ with $N=1024$ samples: select $t^\ast=\arg\max_t \widehat{I}_{\mathrm{test}}(t)$ and report $\widehat{I}_{\mathrm{train}}(t^\ast)$. Solid MI curves shown (and used to locate the maximum) are median-filtered over 20 epochs. (B)$d_{\rm eff}$ from the trained encoders ($k_z=64$) versus number of training samples for the two representative latent distributions (cross-covariance computed on the full training set). (C,D)$d_{\rm eff}$ from a single model with $k_z=64$ versus $K_Z$ for jointly Gaussian latents: varying total MI at fixed $N=4096$ (C), and varying $N$ at fixed $I=2$ bits (D). Usual notation for markers/error bars is used.
...and 12 more figures

Mutual information and task-relevant latent dimensionality

TL;DR

Abstract

Mutual information and task-relevant latent dimensionality

Authors

TL;DR

Abstract

Table of Contents

Figures (17)