Table of Contents
Fetching ...

MMG: Mutual Information Estimation via the MMSE Gap in Diffusion

Longxuan Yu, Xing Shi, Xianghao Kong, Tong Jia, Greg Ver Steeg

TL;DR

MMG redefines mutual information estimation through the integrated MMSE gap between conditional and unconditional denoisers in diffusion models, linking I(x;y) to a half-area under the MMSE gap across SNRs. It introduces adaptive importance sampling to target informative SNR ranges and an orthogonal principle to stabilize the MI integrand, delivering state-of-the-art performance on MI benchmarks and strong reliability in high-MI regimes. The approach avoids gradient-based score estimation, relying instead on denoising objectives, and is released as a unified PyTorch library for future side-by-side comparisons. Overall, MMG provides a scalable, robust, diffusion-based MI estimator with practical impact for measuring relationships in complex systems.

Abstract

Mutual information (MI) is one of the most general ways to measure relationships between random variables, but estimating this quantity for complex systems is challenging. Denoising diffusion models have recently set a new bar for density estimation, so it is natural to consider whether these methods could also be used to improve MI estimation. Using the recently introduced information-theoretic formulation of denoising diffusion models, we show the diffusion models can be used in a straightforward way to estimate MI. In particular, the MI corresponds to half the gap in the Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion, integrated over all Signal-to-Noise-Ratios (SNRs) in the noising process. Our approach not only passes self-consistency tests but also outperforms traditional and score-based diffusion MI estimators. Furthermore, our method leverages adaptive importance sampling to achieve scalable MI estimation, while maintaining strong performance even when the MI is high.

MMG: Mutual Information Estimation via the MMSE Gap in Diffusion

TL;DR

MMG redefines mutual information estimation through the integrated MMSE gap between conditional and unconditional denoisers in diffusion models, linking I(x;y) to a half-area under the MMSE gap across SNRs. It introduces adaptive importance sampling to target informative SNR ranges and an orthogonal principle to stabilize the MI integrand, delivering state-of-the-art performance on MI benchmarks and strong reliability in high-MI regimes. The approach avoids gradient-based score estimation, relying instead on denoising objectives, and is released as a unified PyTorch library for future side-by-side comparisons. Overall, MMG provides a scalable, robust, diffusion-based MI estimator with practical impact for measuring relationships in complex systems.

Abstract

Mutual information (MI) is one of the most general ways to measure relationships between random variables, but estimating this quantity for complex systems is challenging. Denoising diffusion models have recently set a new bar for density estimation, so it is natural to consider whether these methods could also be used to improve MI estimation. Using the recently introduced information-theoretic formulation of denoising diffusion models, we show the diffusion models can be used in a straightforward way to estimate MI. In particular, the MI corresponds to half the gap in the Minimum Mean Square Error (MMSE) between conditional and unconditional diffusion, integrated over all Signal-to-Noise-Ratios (SNRs) in the noising process. Our approach not only passes self-consistency tests but also outperforms traditional and score-based diffusion MI estimators. Furthermore, our method leverages adaptive importance sampling to achieve scalable MI estimation, while maintaining strong performance even when the MI is high.

Paper Structure

This paper contains 23 sections, 12 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The mutual information is exactly half the area between MMSE curves for conditional and unconditional denoising. We use denoising diffusion models to approximate the MMSE curves, then numerically integrate to get an estimate of the mutual information.
  • Figure 2: The schematic of the MMG training process. Noise at a given $\gamma$ level is added to the data ${\bm{x}}$, and a denoiser is used to recover it, with or without conditioning on $y$ at a 50% probability. Finally, the MI loss, as defined in Eq. \ref{['eq:classifier_train']}, is computed to backpropagate the gradient.
  • Figure 3: High MI benchmark: original (column (b)) and transformed variants (columns (a) and (c)).
  • Figure 4: Consistency Tests over MNIST dataset: (a) evaluation of $\frac{I(A; B_r)}{I(A; B)}$; (b) evaluation of $\frac{I(A; [B_{r+k}, B_r])}{I(A; B_{r+k})}$ for $k>0$; (c) evaluation of $\frac{I([A^1, A^2]; [B^1_r, B^2_r])}{I(A^1; B^1_r)}$.
  • Figure 5: Integrand analysis on a Spiral-transformed task (GT MI = 9.90). Comparison between (a) the volatile direct MMSE subtraction method and (b) the stable orthogonal principle. In both plots, the adaptive sampler (red) outperforms the baseline (blue) by focusing on the most informative region.