Approximating mutual information of high-dimensional variables using learned representations

Gokul Gowri; Xiao-Kang Lun; Allon M. Klein; Peng Yin

Approximating mutual information of high-dimensional variables using learned representations

Gokul Gowri, Xiao-Kang Lun, Allon M. Klein, Peng Yin

TL;DR

Mutual information $I(X;Y)$ is a powerful but difficult-to-estimate dependence measure in high dimensions due to sample complexity. The authors introduce latent mutual information (LMI) approximation, which learns low-dimensional encodings $Z_x=f(X)$ and $Z_y=g(Y)$ via cross-predictive autoencoders and estimates $I(Z_x;Z_y)$ with a nonparametric estimator, ensuring $I(Z_x;Z_y)\le I(X;Y)$. They demonstrate stability and accuracy of LMI on synthetic data with high ambient dimensionality and low intrinsic dependence, and on resampled real-world datasets; LMI outperforms standard MI estimators in high dimensions. The method is then applied to biology: quantifying interaction information in ProtTrans5 embeddings for protein interactions and identifying cell fate information in LT-seq scRNA-seq data, revealing nontrivial MI and non-Markovian dynamics. The work provides open-source code and a framework for benchmarking high-dimensional MI, with explicit limitations when intrinsic dimensionality is large and a call for careful interpretation of MI estimates in practical applications.

Abstract

Mutual information (MI) is a general measure of statistical dependence with widespread application across the sciences. However, estimating MI between multi-dimensional variables is challenging because the number of samples necessary to converge to an accurate estimate scales unfavorably with dimensionality. In practice, existing techniques can reliably estimate MI in up to tens of dimensions, but fail in higher dimensions, where sufficient sample sizes are infeasible. Here, we explore the idea that underlying low-dimensional structure in high-dimensional data can be exploited to faithfully approximate MI in high-dimensional settings with realistic sample sizes. We develop a method that we call latent MI (LMI) approximation, which applies a nonparametric MI estimator to low-dimensional representations learned by a simple, theoretically-motivated model architecture. Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with $> 10^3$ dimensions if their dependence structure has low intrinsic dimensionality. Finally, we showcase LMI on two open problems in biology. First, we approximate MI between protein language model (pLM) representations of interacting proteins, and find that pLMs encode non-trivial information about protein-protein interactions. Second, we quantify cell fate information contained in single-cell RNA-seq (scRNA-seq) measurements of hematopoietic stem cells, and find a sharp transition during neutrophil differentiation when fate information captured by scRNA-seq increases dramatically.

Approximating mutual information of high-dimensional variables using learned representations

TL;DR

Mutual information

is a powerful but difficult-to-estimate dependence measure in high dimensions due to sample complexity. The authors introduce latent mutual information (LMI) approximation, which learns low-dimensional encodings

and

via cross-predictive autoencoders and estimates

with a nonparametric estimator, ensuring

. They demonstrate stability and accuracy of LMI on synthetic data with high ambient dimensionality and low intrinsic dependence, and on resampled real-world datasets; LMI outperforms standard MI estimators in high dimensions. The method is then applied to biology: quantifying interaction information in ProtTrans5 embeddings for protein interactions and identifying cell fate information in LT-seq scRNA-seq data, revealing nontrivial MI and non-Markovian dynamics. The work provides open-source code and a framework for benchmarking high-dimensional MI, with explicit limitations when intrinsic dimensionality is large and a call for careful interpretation of MI estimates in practical applications.

Abstract

dimensions if their dependence structure has low intrinsic dimensionality. Finally, we showcase LMI on two open problems in biology. First, we approximate MI between protein language model (pLM) representations of interacting proteins, and find that pLMs encode non-trivial information about protein-protein interactions. Second, we quantify cell fate information contained in single-cell RNA-seq (scRNA-seq) measurements of hematopoietic stem cells, and find a sharp transition during neutrophil differentiation when fate information captured by scRNA-seq increases dramatically.

Paper Structure (33 sections, 4 theorems, 27 equations, 12 figures, 1 table, 5 algorithms)

This paper contains 33 sections, 4 theorems, 27 equations, 12 figures, 1 table, 5 algorithms.

Introduction
Approach
Empirical evaluation
Evaluating mutual information estimators on synthetic data
Empirically quantifying convergence rates of MI estimators on synthetic data
Evaluating mutual information estimators on resampled real-world data
Applications
Quantifying interaction information in protein language model embeddings
Identifying cell fate information in hematopoietic stem cells
Discussion
Limitations
Broader impacts
Code reproducibility
Author contributions
Appendix / supplemental material
...and 18 more sections

Key Result

Theorem 1

Let $X = [X_1,\ldots,X_d]$ and $Z = [Z_1,\ldots,Z_k]$ be absolutely continuous random vectors in $\mathbb{R}^d$ and $\mathbb{R}^k$ respectively with finite differential entropy. Let $f_\theta : \mathbb{R}^k \to \mathbb{R}^d$ be a function (a neural network parameterized by $\theta$) to estimate $\ha where $\alpha$ is a positive constant and $\text{MSE}(\hat{X}, X) = \frac{1}{d} \sum_i \mathbb{E}[(

Figures (12)

Figure 1: Workflow of latent MI approximationa) Embed high-dimensional data in low-dimensional space such that mutually informative structure is preserved. b) The KSG estimator Kraskov2004-sh is used to estimate MI by averaging over pointwise MI (pMI) contributions.
Figure 2: MI estimator performance scaling with increasing dimensionality.a) - d) Absolute accuracy measured by mean-squared error over 10 estimates per setting, with ground truth MI between 0 and 2 bits, and $2\cdot10^3$ samples per estimate. e) Estimator with highest absolute accuracy in each setting. Ties broken randomly. f) - i) Relative accuracy measured by Kendall $\tau$ rank correlation of estimates with ground truth. j) Estimator with highest relative accuracy in each setting. Ties broken randomly.
Figure 3: Number of samples required to achieve $|I(X, Y) - \hat{I} (X, Y)| < \epsilon$.a) Data with low-rank dependence structure, with $\epsilon = 0.1$. b) Moderate-rank dependence structure, with $\epsilon=0.2$. c) Full-rank dependence structure, with $\epsilon=0.4$ "+" marker indicates that $N>10^4$ samples are required for accurate estimates for all larger $d$.
Figure 4: Performance of MI estimators on resampled real datasets.a) Estimates on resampled pairs of MNIST digits, with $5\cdot10^3$ samples and $784$ dimensions. b) Estimates on resampled pairs of ProtTrans5 sequence embeddings, with $4.4 \cdot 10^3$ samples and $1024$ dimensions. c) Statistics of estimator accuracy and runtime (in seconds), for each dataset type.
Figure 5: Quantifying dependence between participants of protein interactions.a) - b) MI estimates between interaction partners, compared to randomly permuted data. c) - d) ROC curves of density ratio classifier distinguishing annotated interacting pairs from unannotated "negative" samples, for all pairs of $170$ held-out proteins. Averages over 20 random hold-out splits.
...and 7 more figures

Theorems & Definitions (8)

Theorem 1
proof
Theorem 2
proof
Theorem 3
proof
Theorem 4
proof

Approximating mutual information of high-dimensional variables using learned representations

TL;DR

Abstract

Approximating mutual information of high-dimensional variables using learned representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (8)