Mutual Information Estimation via $f$-Divergence and Data Derangements

Nunzio A. Letizia; Nicola Novello; Andrea M. Tonello

Mutual Information Estimation via $f$-Divergence and Data Derangements

Nunzio A. Letizia, Nicola Novello, Andrea M. Tonello

TL;DR

A novel class of discriminative mutual information estimators based on the variational representation of the $f$-divergence is proposed, which offers higher accuracy and lower complexity than state-of-the-art neural estimators.

Abstract

Estimating mutual information accurately is pivotal across diverse applications, from machine learning to communications and biology, enabling us to gain insights into the inner mechanisms of complex systems. Yet, dealing with high-dimensional data presents a formidable challenge, due to its size and the presence of intricate relationships. Recently proposed neural methods employing variational lower bounds on the mutual information have gained prominence. However, these approaches suffer from either high bias or high variance, as the sample size and the structure of the loss function directly influence the training process. In this paper, we propose a novel class of discriminative mutual information estimators based on the variational representation of the $f$-divergence. We investigate the impact of the permutation function used to obtain the marginal training samples and present a novel architectural solution based on derangements. The proposed estimator is flexible since it exhibits an excellent bias/variance trade-off. The comparison with state-of-the-art neural estimators, through extensive experimentation within established reference scenarios, shows that our approach offers higher accuracy and lower complexity.

Mutual Information Estimation via $f$-Divergence and Data Derangements

TL;DR

A novel class of discriminative mutual information estimators based on the variational representation of the

-divergence is proposed, which offers higher accuracy and lower complexity than state-of-the-art neural estimators.

Abstract

-divergence. We investigate the impact of the permutation function used to obtain the marginal training samples and present a novel architectural solution based on derangements. The proposed estimator is flexible since it exhibits an excellent bias/variance trade-off. The comparison with state-of-the-art neural estimators, through extensive experimentation within established reference scenarios, shows that our approach offers higher accuracy and lower complexity.

Paper Structure (37 sections, 13 theorems, 86 equations, 20 figures, 6 tables)

This paper contains 37 sections, 13 theorems, 86 equations, 20 figures, 6 tables.

Introduction
Related Work
$f$-Divergence Mutual Information Estimation
Variance Analysis
Derangement Strategy
Experimental Results
Architectures
Complex Gaussian and non-Gaussian distributions
Self-Consistency Tests
Conclusions
Appendix: DIME Estimators
KL divergence
GAN divergence
Hellinger distance
Appendix: Related Work Mutual Information Estimators
...and 22 more sections

Key Result

Theorem 3.1

Let $(X,Y) \sim p_{XY}(\mathbf{x},\mathbf{y})$ be a pair of multivariate random variables. Let $\sigma(\cdot)$ be a permutation function such that $p_{\sigma(Y)}(\sigma(\mathbf{y})|\mathbf{x}) = p_{Y}(\mathbf{y})$ and $T:\mathrm{dom}(X)\times \mathrm{dom}(Y) \to \mathbb{R}$. Let $f^*$ be the Fenchel then and

Figures (20)

Figure 1: MI estimate obtained with derangement and permutation training procedures, for data dimension $d=20$ and batch size $N=128$.
Figure 2: Staircase MI estimation comparison for $d=5$ and $N=64$. The Gaussian case is reported in the top row, while the cubic case is shown in the bottom row.
Figure 3: Staircase MI estimation comparison for $d=5$ and $N=64$. Top: Half-cube scenario. Middle: Asinh scenario. Bottom: Swiss roll scenario.
Figure 4: Staircase MI estimation comparison for $d=1$ and $N=64$. Top row: Uniform scenario. Bottom row: Student scenario
Figure 5: Time requirements comparison to complete the 5-step staircase MI. From the left, the first and second behaviors vary over the batch size. The last one varies over the probability distribution dimension.
...and 15 more figures

Theorems & Definitions (21)

Theorem 3.1
Lemma 3.2
Lemma 5.1
Theorem 5.2
Corollary 5.3: Permutation bound
Theorem 3.1
proof
Lemma 3.2
proof
Lemma 4.1
...and 11 more

Mutual Information Estimation via $f$-Divergence and Data Derangements

TL;DR

Abstract

Mutual Information Estimation via $f$-Divergence and Data Derangements

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (21)