Table of Contents
Fetching ...

Accurate Estimation of Mutual Information in High Dimensional Data

Eslam Abdelaleem, K. Michael Martini, Ilya Nemenman

TL;DR

This work tackles the challenge of accurately estimating the mutual information $I(X;Y)$ in high-dimensional, undersampled settings. It proposes a practical estimation protocol with explicit consistency checks, confidence intervals, and a generalized critic family including probabilistic VSIB variants, enabling reliable inference even when data exhibit complex nonlinear dependencies. The authors show that reliable MI estimation is achievable when the dependence is captured in a low-dimensional latent space, the critic is expressive enough, and the dataset sufficiently samples that latent structure, aided by max-test stopping and subsampling-extrapolation techniques. Across synthetic benchmarks and a real-world MNIST-based case, the methodology matches or surpasses existing estimators while providing quantified uncertainty, thereby broadening the practical applicability of neural MI estimators in scientific research.

Abstract

Mutual information (MI) is a fundamental measure of statistical dependence between two variables, yet accurate estimation from finite data remains notoriously difficult. No estimator is universally reliable, and common approaches fail in the high-dimensional, undersampled regimes typical of modern experiments. Recent machine learning-based estimators show promise, but their accuracy depends sensitively on dataset size, structure, and hyperparameters, with no accepted tests to detect failures. We close these gaps through a systematic evaluation of classical and neural MI estimators across standard benchmarks and new synthetic datasets tailored to challenging high-dimensional, undersampled regimes. We contribute: (i) a practical protocol for reliable MI estimation with explicit checks for statistical consistency; (ii) confidence intervals (error bars around estimates) that existing neural MI estimator do not provide; and (iii) a new class of probabilistic critics designed for high-dimensional, high-information settings. We demonstrate the effectiveness of our protocol with computational experiments, showing that it consistently matches or surpasses existing methods while uniquely quantifying its own reliability. We show that reliable MI estimation is sometimes achievable even in severely undersampled, high-dimensional datasets, provided they admit accurate low-dimensional representations. This broadens the scope of applicability of neural MI estimators and clarifies when such estimators can be trusted.

Accurate Estimation of Mutual Information in High Dimensional Data

TL;DR

This work tackles the challenge of accurately estimating the mutual information in high-dimensional, undersampled settings. It proposes a practical estimation protocol with explicit consistency checks, confidence intervals, and a generalized critic family including probabilistic VSIB variants, enabling reliable inference even when data exhibit complex nonlinear dependencies. The authors show that reliable MI estimation is achievable when the dependence is captured in a low-dimensional latent space, the critic is expressive enough, and the dataset sufficiently samples that latent structure, aided by max-test stopping and subsampling-extrapolation techniques. Across synthetic benchmarks and a real-world MNIST-based case, the methodology matches or surpasses existing estimators while providing quantified uncertainty, thereby broadening the practical applicability of neural MI estimators in scientific research.

Abstract

Mutual information (MI) is a fundamental measure of statistical dependence between two variables, yet accurate estimation from finite data remains notoriously difficult. No estimator is universally reliable, and common approaches fail in the high-dimensional, undersampled regimes typical of modern experiments. Recent machine learning-based estimators show promise, but their accuracy depends sensitively on dataset size, structure, and hyperparameters, with no accepted tests to detect failures. We close these gaps through a systematic evaluation of classical and neural MI estimators across standard benchmarks and new synthetic datasets tailored to challenging high-dimensional, undersampled regimes. We contribute: (i) a practical protocol for reliable MI estimation with explicit checks for statistical consistency; (ii) confidence intervals (error bars around estimates) that existing neural MI estimator do not provide; and (iii) a new class of probabilistic critics designed for high-dimensional, high-information settings. We demonstrate the effectiveness of our protocol with computational experiments, showing that it consistently matches or surpasses existing methods while uniquely quantifying its own reliability. We show that reliable MI estimation is sometimes achievable even in severely undersampled, high-dimensional datasets, provided they admit accurate low-dimensional representations. This broadens the scope of applicability of neural MI estimators and clarifies when such estimators can be trusted.

Paper Structure

This paper contains 26 sections, 36 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: MI estimators in the low-dimensional, infinite-data regime. Each panel plots running MI estimates over training iterations for five true MI levels (increasing every 4000 iterations). Each step introduces a fresh batch of 128 samples. We compare the CCA-based estimator (optimal for Gaussian data), InfoNCE, SMILE, and their probabilistic variants (denoted with VSIB). Faint curves show raw estimates; bold curves show smoothed trends (Appx. \ref{['implementation_details']}). Left: For jointly Gaussian $X, Y$, all estimators initially perform well. InfoNCE plateaus at its well-known intrinsic upper bound oord2018representation$\log$(batch size) $\approx 7$ bits, while SMILE begins to overestimate at high MI, indicating overfitting. $I_{\rm CCA}$ overlaps with ground truth, as expected. Middle: Cubing $Y$ breaks linearity, and $I_{\rm CCA}$ fails. Nonetheless, InfoNCE behavior is almost unchanged, and SMILE remains reasonably effective with sufficient training (Appx. Fig. \ref{['si_fig:low_dim_infinite_more_epochs']}). Both slightly underestimate at low MI, and for SMILE this is largely offset by its intrinsic positive bias at high MI. Right: Passing $X$ and $Y$ through separate frozen teacher networks (one hidden softplus layer, 1024 units) creates highly nonlinear dependencies. All estimators underestimate MI.
  • Figure 2: MI estimators in the high-dimensional, infinite-data regime. We extend Fig. \ref{['fig:low_dim_infinite_main']} to $K_X = K_Y = 500$, embedding $K_Z = 10$ latent variables into high-dimensional $X$ and $Y$. Left: A linear transformation (e.g., identity and replication) expands $Z$ to 500-dimensional $X$ and $Y$. $I_{\rm CCA}$ with $k_z\ge K_Z=10$ dimensions still accurately recovers the ground truth. Middle: A frozen nonlinear teacher network maps $Z$ to 500-dimensional $X$ and $Y$. Unlike the linear case, $I_{\rm CCA}$ fails due to the nonlinearity of the transformation. Increasing $k_Z>K_Z$ detects spurious correlations, inflating MI estimates and illustrating the limitations of linear methods in nonlinear settings. Right: Neural estimators (InfoNCE, SMILE, and their VSIB variants) are applied directly to the full 500-dimensional data. All are accurate across the full range of true MI values, performing even better than in Fig. \ref{['fig:low_dim_infinite_main']} due to improved invertibility of the nonlinear transformation in high dimensions.
  • Figure 3: The stopping heuristic. We evaluate neural MI estimators for finite-data using the teacher model from Fig. \ref{['fig:high_dim_infinite_main']}, where 10 latent variables carrying 4 bits of MI are embedded in 500-dimensional $X$ and $Y$. We compare two sampling regimes for InfoNCE (left) and SMILE (right): 256 samples (under-sampled) and a larger dataset of $2^{14} = 16{,}384$ samples (better-sampled). In all cases, the test-set MI initially rises before declining due to overfitting (we do not show the negative values). The stopping heuristic selects the epoch with the peak test MI but reports the corresponding training MI. Here the batch size is 128, so that InfoNCE does not saturate.
  • Figure 4: MI vs. sample size for low and high information. We compare InfoNCE, SMILE, and $\text{VSIB}$ versions with the max-test stopping for different sample sizes. Data from the frozen teacher model ($10$ latent, $500$ data dimensions). All estimators use separable critics, $k_z=32$. Means $\pm$ s.d. over 10 trials shown. Left: For small MI (4 bits), all estimators recover the ground truth for $10^2\lesssim N< K=500$. Right: For high MI (8 bits), contrastive estimators (InfoNCE and $\text{InfoNCE}-\text{VSIB}$) saturate near $\log(\text{batch size})=7$ bits. SMILE overestimates dramatically as $N$ grows. $\text{SMILE}-\text{VSIB}$ tracks the ground truth accurately for all $N\gtrsim 10^2$.
  • Figure 5: Effect of latent and critic dimensionality on InfoNCE. Curves show mean $\pm$ s.d. over ten runs. Panel represents $K=500$-dimensional data generated by teacher networks with latent dimensionality $K_Z = 10, 100,$ and $500$ (left to right). The true MI is 4 bits throughout. A sufficiently expressive critic ($k_Z \ge K_Z$) is required to recover all the information, yet the estimate approaches 4 bits only in the low-dimensional latent case ($K_Z \ll K$) when sample size satisfies $N \gg K_Z$. For larger latent spaces, the estimate remains far below the target even with large $N$. Vertical lines mark $N$ needed for detection of nonzero MI using Gaussian random matrix models in the latent space ($N^*_Z$) and in the full space ($N^*$), cf. Appx. \ref{['sec:SpikeDetection']} ($N^*_Z\approx 1$ not shown in the left panel; and $N^*_Z=N^*$ in the right panel). Since a nonzero estimate emerges at $N>N^*_Z$, but $N<N^*$ if $K_Z\ll K$, sampling of the latent space (not the full data space) governs the estimation even in the non-Gaussian setting.
  • ...and 6 more figures