Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

Mijoo Kim; Junseok Kwon

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

Mijoo Kim, Junseok Kwon

TL;DR

This work tackles uncertainty calibration for deep neural networks under distribution shifts, including in-distribution and out-of-distribution data. It introduces an energy based instance wise post hoc calibration method that uses a per input scaling factor derived from energy scores computed from the network logits, and trains parameters to adapt calibration to each sample. By leveraging free energy and per-sample energy distributions for correct and incorrect predictions, the approach achieves robust calibration across ID, covariate shift, and semantic shift, and demonstrates superior or competitive performance against established baselines and DAC. The method enhances reliable uncertainty estimates for safety critical applications and offers a practical, lightweight post hoc calibration option that generalizes across architectures and datasets. Overall, the paper presents a principled energy based framework that improves the trustworthiness of DNN predictions in the wild and provides strong empirical validation across multiple benchmarks.

Abstract

With the rapid advancement in the performance of deep neural networks (DNNs), there has been significant interest in deploying and incorporating artificial intelligence (AI) systems into real-world scenarios. However, many DNNs lack the ability to represent uncertainty, often exhibiting excessive confidence even when making incorrect predictions. To ensure the reliability of AI systems, particularly in safety-critical cases, DNNs should transparently reflect the uncertainty in their predictions. In this paper, we investigate robust post-hoc uncertainty calibration methods for DNNs within the context of multi-class classification tasks. While previous studies have made notable progress, they still face challenges in achieving robust calibration, particularly in scenarios involving out-of-distribution (OOD). We identify that previous methods lack adaptability to individual input data and struggle to accurately estimate uncertainty when processing inputs drawn from the wild dataset. To address this issue, we introduce a novel instance-wise calibration method based on an energy model. Our method incorporates energy scores instead of softmax confidence scores, allowing for adaptive consideration of DNN uncertainty for each prediction within a logit space. In experiments, we show that the proposed method consistently maintains robust performance across the spectrum, spanning from in-distribution to OOD scenarios, when compared to other state-of-the-art methods.

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

TL;DR

Abstract

Paper Structure (25 sections, 14 equations, 9 figures, 9 tables, 2 algorithms)

This paper contains 25 sections, 14 equations, 9 figures, 9 tables, 2 algorithms.

Introduction
Related work
Post-hoc Calibration
Beyond In-distribution Calibration
Problem Setup
Notation
Calibration Metric
Calibration in OOD Scenarios
Proposed Method
Mathematical Motivation
Robust Instance-wise Calibration
Experiment
Experimental Settings
Ablation Study on Energy Score
Comparison with Baseline Methods
...and 10 more sections

Figures (9)

Figure 1: Conventional softmax confidence scores versus the proposed energy scores. First column: Softmax confidence score (top) and negative energy score (bottom) for correct and incorrect samples. Second column: Softmax confidence score (top) and negative energy score (bottom) for in-distribution (CIFAR10) and OOD samples (SVHN). Our energy scores exhibit greater separability between correct and incorrect predictions, as well as between in-distribution and OOD samples in DenseNet201.
Figure 2: The overall pipeline for our calibration method. An input image $\mathbf{x}$ from the wild dataset is fed into a pre-trained classifier, producing the logit $\mathbf{z}$. Subsequently, the free energy $\mathcal{F}$ defined in \ref{['ref:free energy']} is calculated for each logit $\mathbf{z}$. Then, two probability density functions (PDFs), i.e.$P_{correct}$ and $P_{incorrect}$, are estimated based on the free energy $\mathcal{F}$ of correct and incorrect instances, respectively. These PDFs are utilized to adjust the scaling factors $\lambda_{1}$ and $\lambda_{2}$ in \ref{['ref:lambda']}. With the scaling factors, the parameters $\theta_{1}$ and $\theta_{2}$ are then trained through the optimization of the loss function in \ref{['ref:obj']}. Using the trained parameters, the calibrated confidence for a test image $\mathbf{x}$ can be calculated by applying scaling in an instance-wise manner as in \ref{['ref:calibrated confidence']}.
Figure 3: Negative energy scores and accuracy from ID to semantic OOD data. The x-axis shows the severity levels of corruptions, while the left y-axis shows the negative energy scores (box plot) and the right y-axis shows accuracy (line plot). Severity level 0 (light blue) indicates the negative energy score for ID data (CIFAR10), levels 1-5 (deepening shades of blue) represent increasing degrees of corruption (CIFAR10-C), and the negative energy score for semantic OOD data (SVHN) is depicted in purple.
Figure 4: Calibration errors across different levels of corruption severity. We can observe that the our method also benefits from the calibration effect in complete ID (severity level 0), while outperforming other approaches in the remaining severity levels. We used WideResNet40 trained on CIFAR10.
Figure 5: Expected Calibration Error (ECE) for various corruption types at severity level 5. It is evident that our proposed method exhibits superior performance across various types of corruption when compared to other methods, using DenseNet201 trained on the CIFAR100 dataset.
...and 4 more figures

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

TL;DR

Abstract

Uncertainty Calibration with Energy Based Instance-wise Scaling in the Wild Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (9)