Measuring Stochastic Data Complexity with Boltzmann Influence Functions

Nathan Ng; Roger Grosse; Marzyeh Ghassemi

Measuring Stochastic Data Complexity with Boltzmann Influence Functions

Nathan Ng, Roger Grosse, Marzyeh Ghassemi

TL;DR

This paper tackles uncertainty quantification under distribution shifts by reframing predictive uncertainty through the Minimum Description Length (MDL) lens, specifically using predictive normalized maximum likelihood (pNML). It introduces IF-COMP, a scalable approximation that employs Boltzmann influence functions (BIFs) to linearize models and estimate hindsight-optimal outputs and stochastic data complexity for both labeled and unlabeled data. The method achieves calibrated predictions and competitive complexity estimates while delivering substantial speedups over existing pNML-based approaches, and it demonstrates strong performance across uncertainty calibration, mislabel detection, and OOD detection benchmarks. Overall, the work highlights the practical viability of MDL-based uncertainty estimation in deep networks and provides a framework that blends theory with efficient, empirically validated algorithms for reliability under distribution shifts.

Abstract

Estimating the uncertainty of a model's prediction on a test point is a crucial part of ensuring reliability and calibration under distribution shifts. A minimum description length approach to this problem uses the predictive normalized maximum likelihood (pNML) distribution, which considers every possible label for a data point, and decreases confidence in a prediction if other labels are also consistent with the model and training data. In this work we propose IF-COMP, a scalable and efficient approximation of the pNML distribution that linearizes the model with a temperature-scaled Boltzmann influence function. IF-COMP can be used to produce well-calibrated predictions on test points as well as measure complexity in both labelled and unlabelled settings. We experimentally validate IF-COMP on uncertainty calibration, mislabel detection, and OOD detection tasks, where it consistently matches or beats strong baseline methods.

Measuring Stochastic Data Complexity with Boltzmann Influence Functions

TL;DR

Abstract

Paper Structure (31 sections, 24 equations, 7 figures, 3 tables)

This paper contains 31 sections, 24 equations, 7 figures, 3 tables.

Introduction
Background and Preliminaries
Minimum Description Length and Stochastic Complexity
The Infinity Problem and Proximal Bregman Objective
Influence Functions
IF-COMP: Measuring Complexity with Boltzmann Influence Functions
Boltzmann Influence Functions
IF-COMP
Efficiently Computing IF-COMP
pNML Validation
Experiments
Uncertainty Calibration
Mislabel Detection
Analyzing the Components of IF-COMP
OOD Detection
...and 16 more sections

Figures (7)

Figure 1: Pearson R correlation of different methods of approximating hindsight-optimal outputs with ground truth parametric complexity on in-domain (CIFAR-10) and out-of-domain datasets. IF-COMP achieves the highest correlation across all datasets, beating ACNML, a computationally more expensive alternative.
Figure 2: Reliability diagrams for Pixelate corruptions on CIFAR-10C. IF-COMP outperform ACNML as well as Bayesian methods and ensembles even as corruptions increase in severity. Although IF-COMP and ACNML perform similarly on lower confidence examples, IF-COMP maintains this reliability on higher confidence examples. Dotted lines represent perfect calibration.
Figure 3: Expected calibration error (ECE) for various methods across increasing levels of CIFAR-10C corruptions. We plot medians and inter-quartile ranges. IF-COMP achieves lower ECE across almost all corruption levels compared to both Bayesian methods and other NML-based methods.
Figure 4: IF-COMP accurately trades off between log error and parametric complexity, maintaining strong AUROC throughout training. Tuning the temperature is critical to achieving accurate complexity estimates near convergence.
Figure 5: Data pruning results. Shaded regions correspond to standard deviations over 5 seeds.IF-COMP performs similarly to other methods that require access to additional checkpoints, including Trac-IN, GraNd, and EL2N. At the highest pruning levels for CIFAR-100, IF-COMP and Self-IF outperform baselines that perform worse than random.
...and 2 more figures

Measuring Stochastic Data Complexity with Boltzmann Influence Functions

TL;DR

Abstract

Measuring Stochastic Data Complexity with Boltzmann Influence Functions

Authors

TL;DR

Abstract

Table of Contents

Figures (7)