InfoNet: Neural Estimation of Mutual Information without Test-Time Optimization

Zhengyang Hu; Song Kang; Qunsong Zeng; Kaibin Huang; Yanchao Yang

InfoNet: Neural Estimation of Mutual Information without Test-Time Optimization

Zhengyang Hu, Song Kang, Qunsong Zeng, Kaibin Huang, Yanchao Yang

TL;DR

InfoNet addresses the challenge of real-time mutual information estimation between data streams by learning an attention-based neural network that directly outputs the optimal discriminant for the Donsker–Varadhan MI dual, eliminating test-time optimization. It trains on diverse, simulated distributions (notably Gaussian mixtures) with copula normalization to generalize across unseen distributions, enabling fast, differentiable MI estimates from streaming data. The approach discretizes the optimal discriminant into a 2D tensor readout, and is validated across synthetic and real-world tasks, including high-dimensional independence testing via sliced MI and motion-data experiments, showing favorable efficiency-accuracy trade-offs and robust generalization. The work provides a practical toolbox for real-time MI estimation in multimodal and streaming settings, with strong implications for embodied AI and online decision-making.

Abstract

Estimating mutual correlations between random variables or data streams is essential for intelligent behavior and decision-making. As a fundamental quantity for measuring statistical relationships, mutual information has been extensively studied and utilized for its generality and equitability. However, existing methods often lack the efficiency needed for real-time applications, such as test-time optimization of a neural network, or the differentiability required for end-to-end learning, like histograms. We introduce a neural network called InfoNet, which directly outputs mutual information estimations of data streams by leveraging the attention mechanism and the computational efficiency of deep learning infrastructures. By maximizing a dual formulation of mutual information through large-scale simulated training, our approach circumvents time-consuming test-time optimization and offers generalization ability. We evaluate the effectiveness and generalization of our proposed mutual information estimation scheme on various families of distributions and applications. Our results demonstrate that InfoNet and its training process provide a graceful efficiency-accuracy trade-off and order-preserving properties. We will make the code and models available as a comprehensive toolbox to facilitate studies in different fields requiring real-time mutual information estimation.

InfoNet: Neural Estimation of Mutual Information without Test-Time Optimization

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 8 equations, 16 figures, 4 tables, 1 algorithm.

Introduction
Problem Statement
Neural MI Estimation without Test-Time Optimization
Dual Estimation of MI
Optimal Discriminant Prediction
Data Generation and Training Algorithm
Experiments
Evaluation Data and Metrics
Evaluation
Setups and Metrics.
Results and Comparisons
Test-Time Efficiency.
Sanity Check on Gaussian.
GMMs with Multiple Components.
Mutual Correlation Order Accuracy.
...and 15 more sections

Figures (16)

Figure 1: Log-scale run time comparison of MINE MINE and the proposed InfoNet, which consistently achieves faster performance by magnitudes across sequences of varying lengths by bypassing the costly test-time optimization.
Figure 2: A comparison of MINE MINE and the proposed InfoNet for neural MI estimation. In the training phase, MINE optimizes an MLP's parameters (as a discriminant function) using the dual formula donsker1983asymptotic against a joint distribution. The optimized MLP then estimates the same distribution's MI with its samples. However, the MLP is not optimal for a new distribution and requires retraining (test-time optimization) before providing an estimate. In contrast, InfoNet is trained on various distributions to output the optimal discriminant ($\theta$) for any distribution. At test time, InfoNet predicts the optimal discriminant for a new distribution using its samples, leveraging the generalization capability from large-scale training, thus eliminating the need for test-time optimization and increasing efficiency.
Figure 3: The proposed InfoNet architecture for MI prediction comprises learnable queries and attention blocks. It accepts a sequence of samples from two random variables and outputs a look-up table (top-right) representing a discretization of the optimal scalar discriminant function defined on the joint domain in the Donsker-Varadhan representation donsker1983asymptotic. The MI between the two random variables (sequences) can then be calculated by summation according to Eq. \ref{['eq:mi-donvara']}. Note that the input sequences for training are sampled from various distributions. Please also refer to Fig. \ref{['fig:mine-vs-infonet']} for a comparison between MINE and InfoNet training schemes.
Figure 4: Comparison of MI estimates under Gaussian settings (runtime included).
Figure 5: Independence testing under three types of data correlations. Each curve in the plots depicts the area under the curve (AUC) of the receiver operating characteristic (ROC) with respect to sequence length $n$. Four MI estimators are compared: InfoNet, KSG, MINE-100, and MINE-1000 (i.e., MINE trained with 100 and 1000 gradient steps during test-time optimization), each with two dimensions (16 and 128). The curves obtained by InfoNet (with the slicing technique) are constantly higher than the others, which demonstrates the effectiveness of InfoNet for dealing with high-dimensional data.
...and 11 more figures

InfoNet: Neural Estimation of Mutual Information without Test-Time Optimization

TL;DR

Abstract

InfoNet: Neural Estimation of Mutual Information without Test-Time Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (16)