Table of Contents
Fetching ...

A Neural Difference-of-Entropies Estimator for Mutual Information

Haoran Ni, Martin Lotz

TL;DR

This work tackles the challenging problem of estimating mutual information in high dimensions without strong modelling assumptions. It introduces a difference-of-entropies (DoE) estimator implemented with block autoregressive normalizing flows that jointly model $H(X)$ and $H(X|Y)$, enabling unbiased, consistent MI estimation via $I(X;Y)=H(X)-H(X|Y)$. Theoretical results establish the existence and properties of block-triangular normalizing flows to represent joint densities and conditional densities, while empirical evaluations show robust performance across Gaussian, nonlinear, and heavy-tailed distributions, often surpassing state-of-the-art discriminative and generative baselines. The approach offers a scalable, principled MI estimator with potential impact on ML tasks requiring dependence measures and information-theoretic objectives, though it requires careful architectural choices for stability and may be extended to discrete settings and downstream applications.

Abstract

Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.

A Neural Difference-of-Entropies Estimator for Mutual Information

TL;DR

This work tackles the challenging problem of estimating mutual information in high dimensions without strong modelling assumptions. It introduces a difference-of-entropies (DoE) estimator implemented with block autoregressive normalizing flows that jointly model and , enabling unbiased, consistent MI estimation via . Theoretical results establish the existence and properties of block-triangular normalizing flows to represent joint densities and conditional densities, while empirical evaluations show robust performance across Gaussian, nonlinear, and heavy-tailed distributions, often surpassing state-of-the-art discriminative and generative baselines. The approach offers a scalable, principled MI estimator with potential impact on ML tasks requiring dependence measures and information-theoretic objectives, though it requires careful architectural choices for stability and may be extended to discrete settings and downstream applications.

Abstract

Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.

Paper Structure

This paper contains 20 sections, 5 theorems, 44 equations, 17 figures, 1 algorithm.

Key Result

Lemma 1.1

Let $(X,Y)$ be a pair of random variables with joint density $p$. Then where the infimum is over all conditional densities, i.e., non-negative functions $q(x|y)$ such that $\int_x q(x|y) \ \mathrm{d}x=1$ for all $y$.

Figures (17)

  • Figure 1: A Block Autoregressive Flow $f(y,x)$. Solid lines represent positive weights.
  • Figure 2: MI estimation between multivariate Gaussian variables (Top) and between multivariate Gaussian variables with a cubic transformation (Bottom). The size of training data are 128K. The estimation error $(I(x,y)-\hat{I}(x,y))$ are reported. Closer to zero is better.
  • Figure 3: MI estimation between multivariate Gaussian variables (Top) and between multivariate Gaussian variables with a cubic transformation (Bottom). The size of training data are 64K. The estimation error $(I(x,y)-\hat{I}(x,y))$ are reported. Closer to zero is better.
  • Figure 4: MI estimation between multivariate Gaussian variables (Top) and between multivariate Gaussian variables with a cubic transformation (Bottom). The size of training data are 32K. The estimation error $(I(x,y)-\hat{I}(x,y))$ are reported. Closer to zero is better.
  • Figure 5: MI estimation between multivariate Sparse Gaussian variables (Top) and between multivariate Sparse Gaussian variables with a cubic transformation (Bottom). The size of training data are 128K. The estimation error $(I(x,y)-\hat{I}(x,y))$ are reported. Closer to zero is better.
  • ...and 12 more figures

Theorems & Definitions (8)

  • Lemma 1.1
  • proof
  • Corollary 1.2
  • Example A.1
  • Theorem B.1
  • proof
  • Corollary B.2
  • Theorem B.3