Table of Contents
Fetching ...

Mutual Information Multinomial Estimation

Yanzhi Chen, Zijing Ou, Adrian Weller, Yingzhen Li

TL;DR

This work tackles the challenge of estimating mutual information in high-dimensional, high-MI settings. It introduces Mutual Information Multinomial Estimation (MIME), which uses a multinomial classifier across four distributions, including a marginal-preserving vector Gaussian copula reference, to stabilize MI estimation and reduce overfitting. The authors prove consistency and controlled error bounds, and demonstrate MIME's superior robustness and scalability across synthetic benchmarks, Bayesian experimental design, and self-supervised learning scenarios, outperforming several baselines. The approach offers a practical pathway to reliable MI estimation in complex data regimes and highlights nuanced interactions between MI values and representation quality in SSL.

Abstract

Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning. This work proposes a new estimator for mutual information. Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate. This preliminary estimate serves as a bridge between the joint and the marginal distribution, and by comparing with this bridge distribution we can easily obtain the true difference between the joint distributions and the marginal distributions. Experiments on diverse tasks including non-Gaussian synthetic problems with known ground-truth and real-world applications demonstrate the advantages of our method.

Mutual Information Multinomial Estimation

TL;DR

This work tackles the challenge of estimating mutual information in high-dimensional, high-MI settings. It introduces Mutual Information Multinomial Estimation (MIME), which uses a multinomial classifier across four distributions, including a marginal-preserving vector Gaussian copula reference, to stabilize MI estimation and reduce overfitting. The authors prove consistency and controlled error bounds, and demonstrate MIME's superior robustness and scalability across synthetic benchmarks, Bayesian experimental design, and self-supervised learning scenarios, outperforming several baselines. The approach offers a practical pathway to reliable MI estimation in complex data regimes and highlights nuanced interactions between MI values and representation quality in SSL.

Abstract

Estimating mutual information (MI) is a fundamental yet challenging task in data science and machine learning. This work proposes a new estimator for mutual information. Our main discovery is that a preliminary estimate of the data distribution can dramatically help estimate. This preliminary estimate serves as a bridge between the joint and the marginal distribution, and by comparing with this bridge distribution we can easily obtain the true difference between the joint distributions and the marginal distributions. Experiments on diverse tasks including non-Gaussian synthetic problems with known ground-truth and real-world applications demonstrate the advantages of our method.
Paper Structure (19 sections, 4 theorems, 27 equations, 7 figures, 2 tables)

This paper contains 19 sections, 4 theorems, 27 equations, 7 figures, 2 tables.

Key Result

Proposition 1

(Consistency of multinomial MI estimate). Assuming that the multinomial classifier $h_c: X \times Y \rightarrow \mathbb{R}$ is uniformly bounded. Then, for every $\varepsilon > 0$ there exists $N(\varepsilon) \in \mathbb N$, such that

Figures (7)

  • Figure 1: Comparison of different MI estimators under different $\rho$ in four representative synthetic datasets. The dimensionality $d$ of the data $X, Y \in \mathbbm{R}^d$ in the four cases are 64, 32, 32, 2 respectively.
  • Figure 2: Comparison of different MI estimators under $\rho = 0.7$ and various data dimensionality $d$.
  • Figure 3: Experiment results for BED. (a) Comparing the utility of the optimal design $\mathbf{d}^*$ found by different MI estimators. (b) Visualizing the contour of the underlying function between the utility and the design $\mathbf{d}$ in the death process task. (c) Visualizing the optimal designs found in the PK task.
  • Figure 4: Experimental results for SSL. (a) Classification accuracy on test set as evaluated by a linear classifier. (b) $I(v_1, v_2)$ as estimated by our method. (c) $I(v_1, v_2)$ as computed by InfoNCE.
  • Figure 5: Comparison of different choices of reference distribution $q$. Here $N=10,000$.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4