Uncertainty Quantification Metrics for Deep Regression

Simon Kristoffersson Lind; Ziliang Xiong; Per-Erik Forssén; Volker Krüger

Uncertainty Quantification Metrics for Deep Regression

Simon Kristoffersson Lind, Ziliang Xiong, Per-Erik Forssén, Volker Krüger

TL;DR

This work focuses on regression tasks, and investigates Area Under Sparsification Error, Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood, and finds that Calibration Error is the most stable and interpretable metric.

Abstract

When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.

Uncertainty Quantification Metrics for Deep Regression

TL;DR

Abstract

Paper Structure (20 sections, 13 equations, 5 figures, 5 tables)

This paper contains 20 sections, 13 equations, 5 figures, 5 tables.

Introduction
Related Work
Theory
Different types of Uncertainty
Uncertainty Evaluation Metrics
Regression Models with Uncertainty Predictions
Deep Ensemble (DE)
Energy Based Regression (EBR)
Experiments and Results
Synthetic Regression Datasets
Implementation Details
Stability on varying test set sizes
Metrics under different types of uncertainty
Metrics for Real-world Applications
Discussion
...and 5 more sections

Figures (5)

Figure 1: An illustration for UQ metrics and regression metrics. Note: the axes of CE and AUSE are distinct, but not orthogonal.
Figure 2: The four synthetic regression datasets. Data points are orange, and the solid blue lines represent the expectation of the generating function.
Figure 3: Visualization of the predicted density on the test set for trained models. Contour plots: Log-likelihood output from each model. Yellow: the high-density region; Blue: the low-density region. Blue points: Predicted mean. Orange points: Test set.
Figure 4: Experiments to test two types of stability of metrics under different test dataset sizes.
Figure 5: Sparsification plot from Deep Ensemble and True Distribution for the homoscedastic and heteroscedastic datasets. $\alpha$ is the fraction of removed samples.

Uncertainty Quantification Metrics for Deep Regression

TL;DR

Abstract

Uncertainty Quantification Metrics for Deep Regression

Authors

TL;DR

Abstract

Table of Contents

Figures (5)