Table of Contents
Fetching ...

On Variational Bounds of Mutual Information

Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, George Tucker

TL;DR

This work unifies variational bounds on mutual information (MI) and analyzes why existing lower bounds degrade as MI grows. It introduces a continuum of bounds that trade bias and variance, extends to multi-sample and nonlinear interpolations, and leverages known conditional structures and density-ratio estimators to construct tractable estimators with provable bounds. The authors provide empirical bias-variance characterizations on synthetic high-dimensional problems and demonstrate decoder-free, MI-based representation learning on dSprites, highlighting practical gains in disentanglement under information constraints. The results offer a toolkit of tunable MI bounds that balance tractability and tightness, informing both MI estimation and representation learning in high-dimensional settings.

Abstract

Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.

On Variational Bounds of Mutual Information

TL;DR

This work unifies variational bounds on mutual information (MI) and analyzes why existing lower bounds degrade as MI grows. It introduces a continuum of bounds that trade bias and variance, extends to multi-sample and nonlinear interpolations, and leverages known conditional structures and density-ratio estimators to construct tractable estimators with provable bounds. The authors provide empirical bias-variance characterizations on synthetic high-dimensional problems and demonstrate decoder-free, MI-based representation learning on dSprites, highlighting practical gains in disentanglement under information constraints. The results offer a toolkit of tunable MI bounds that balance tractability and tightness, informing both MI estimation and representation learning in high-dimensional settings.

Abstract

Estimating and optimizing Mutual Information (MI) is core to many problems in machine learning; however, bounding MI in high dimensions is challenging. To establish tractable and scalable objectives, recent work has turned to variational bounds parameterized by neural networks, but the relationships and tradeoffs between these bounds remains unclear. In this work, we unify these recent developments in a single framework. We find that the existing variational lower bounds degrade when the MI is large, exhibiting either high bias or high variance. To address this problem, we introduce a continuum of lower bounds that encompasses previous bounds and flexibly trades off bias and variance. On high-dimensional, controlled problems, we empirically characterize the bias and variance of the bounds and their gradients and demonstrate the effectiveness of our new bounds for estimation and representation learning.

Paper Structure

This paper contains 18 sections, 24 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Schematic of variational bounds of mutual information presented in this paper. Nodes are colored based on their tractability for estimation and optimization: green bounds can be used for both, yellow for optimization but not estimation, and red for neither. Children are derived from their parents by introducing new approximations or assumptions.
  • Figure 2: Performance of bounds at estimating mutual information. Top: The dataset $p(x,y; \rho)$ is a correlated Gaussian with the correlation $\rho$ stepping over time. Bottom: the dataset is created by drawing $x,y\sim p(x, y; \rho)$ and then transforming $y$ to get $(Wy)^3$ where $W_{ij} \sim \mathcal{N}(0, 1)$ and the cubing is elementwise. Critics are trained to maximize each lower bound on MI, and the objective (light) and smoothed objective (dark) are plotted for each technique and critic type. The single-sample bounds ($I_\text{NWJ}$ and $I_\text{JS}$) have higher variance than $I_{\text{NCE}}$ and $I_\alpha$, but achieve competitive estimates on both datasets. While $I_{\text{NCE}}$ is a poor estimator of MI with the small training batch size of 64, the interpolated bounds are able to provide less biased estimates than $I_{\text{NCE}}$ with less variance than $I_\text{NWJ}$. For the more challenging nonlinear relationship in the bottom set of panels, the best estimates of MI are with $\alpha=0.01$. Using a joint critic (orange) outperforms a separable critic (blue) for $I_\text{NWJ}$ and $I_\text{JS}$, while the multi-sample bounds are more robust to the choice of critic architecture.
  • Figure 3: Bias and variance of MI estimates with the optimal critic. While $I_\text{NWJ}$ is unbiased when given the optimal critic, $I_{\text{NCE}}$ can exhibit large bias that grows linearly with MI. The $I_\alpha$ bounds trade off bias and variance to recover more accurate bounds in terms of MSE in certain regimes.
  • Figure 4: Gradient accuracy of MI estimators.Left: MSE between the true encoder gradients and approximate gradients as a function of mutual information and batch size (colors the same as in Fig. \ref{['fig:estbiasvar']} ). Right: For each mutual information and batch size, we evaluated the $I_\alpha$ bound with different $\alpha$s and found the $\alpha$ that had the smallest gradient MSE. For small MI and small size, $I_{\text{NCE}}$-like objectives are preferred, while for large MI and large batch size, $I_\text{NWJ}$-like objectives are preferred.
  • Figure 5: Feature selectivity on dSprites. The representation learned with our regularized InfoMax objective exhibits disentangled features for position and scale, but not rotation. Each row corresponds to a different active latent dimension. The first column depicts the position tuning of the latent variable, where the x and y axis correspond to x/y position, and the color corresponds to the average activation of the latent variable in response to an input at that position (red is high, blue is low). The scale and rotation columns show the average value of the latent on the $y$ axis, and the value of the ground truth factor (scale or rotation) on the x axis.
  • ...and 2 more figures