Table of Contents
Fetching ...

Uncertainty Quantification Metrics for Deep Regression

Simon Kristoffersson Lind, Ziliang Xiong, Per-Erik Forssén, Volker Krüger

TL;DR

This work focuses on regression tasks, and investigates Area Under Sparsification Error, Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood, and finds that Calibration Error is the most stable and interpretable metric.

Abstract

When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.

Uncertainty Quantification Metrics for Deep Regression

TL;DR

This work focuses on regression tasks, and investigates Area Under Sparsification Error, Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood, and finds that Calibration Error is the most stable and interpretable metric.

Abstract

When deploying deep neural networks on robots or other physical systems, the learned model should reliably quantify predictive uncertainty. A reliable uncertainty allows downstream modules to reason about the safety of its actions. In this work, we address metrics for evaluating such an uncertainty. Specifically, we focus on regression tasks, and investigate Area Under Sparsification Error (AUSE), Calibration Error, Spearman's Rank Correlation, and Negative Log-Likelihood (NLL). Using synthetic regression datasets, we look into how those metrics behave under four typical types of uncertainty, their stability regarding the size of the test set, and reveal their strengths and weaknesses. Our results indicate that Calibration Error is the most stable and interpretable metric, but AUSE and NLL also have their respective use cases. We discourage the usage of Spearman's Rank Correlation for evaluating uncertainties and recommend replacing it with AUSE.
Paper Structure (20 sections, 13 equations, 5 figures, 5 tables)

This paper contains 20 sections, 13 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An illustration for UQ metrics and regression metrics. Note: the axes of CE and AUSE are distinct, but not orthogonal.
  • Figure 2: The four synthetic regression datasets. Data points are orange, and the solid blue lines represent the expectation of the generating function.
  • Figure 3: Visualization of the predicted density on the test set for trained models. Contour plots: Log-likelihood output from each model. Yellow: the high-density region; Blue: the low-density region. Blue points: Predicted mean. Orange points: Test set.
  • Figure 4: Experiments to test two types of stability of metrics under different test dataset sizes.
  • Figure 5: Sparsification plot from Deep Ensemble and True Distribution for the homoscedastic and heteroscedastic datasets. $\alpha$ is the fraction of removed samples.