Table of Contents
Fetching ...

Is Epistemic Uncertainty Faithfully Represented by Evidential Deep Learning Methods?

Mira Jürgens, Nis Meinert, Viktor Bengs, Eyke Hüllermeier, Willem Waegeman

TL;DR

Novel theoretical insights of evidential deep learning are presented, highlighting the difficulties in optimizing second-order loss functions and interpreting the resulting epistemic uncertainty measures and the relative (rather than absolute) nature of epistemic uncertainty measures.

Abstract

Trustworthy ML systems should not only return accurate predictions, but also a reliable representation of their uncertainty. Bayesian methods are commonly used to quantify both aleatoric and epistemic uncertainty, but alternative approaches, such as evidential deep learning methods, have become popular in recent years. The latter group of methods in essence extends empirical risk minimization (ERM) for predicting second-order probability distributions over outcomes, from which measures of epistemic (and aleatoric) uncertainty can be extracted. This paper presents novel theoretical insights of evidential deep learning, highlighting the difficulties in optimizing second-order loss functions and interpreting the resulting epistemic uncertainty measures. With a systematic setup that covers a wide range of approaches for classification, regression and counts, it provides novel insights into issues of identifiability and convergence in second-order loss minimization, and the relative (rather than absolute) nature of epistemic uncertainty measures.

Is Epistemic Uncertainty Faithfully Represented by Evidential Deep Learning Methods?

TL;DR

Novel theoretical insights of evidential deep learning are presented, highlighting the difficulties in optimizing second-order loss functions and interpreting the resulting epistemic uncertainty measures and the relative (rather than absolute) nature of epistemic uncertainty measures.

Abstract

Trustworthy ML systems should not only return accurate predictions, but also a reliable representation of their uncertainty. Bayesian methods are commonly used to quantify both aleatoric and epistemic uncertainty, but alternative approaches, such as evidential deep learning methods, have become popular in recent years. The latter group of methods in essence extends empirical risk minimization (ERM) for predicting second-order probability distributions over outcomes, from which measures of epistemic (and aleatoric) uncertainty can be extracted. This paper presents novel theoretical insights of evidential deep learning, highlighting the difficulties in optimizing second-order loss functions and interpreting the resulting epistemic uncertainty measures. With a systematic setup that covers a wide range of approaches for classification, regression and counts, it provides novel insights into issues of identifiability and convergence in second-order loss minimization, and the relative (rather than absolute) nature of epistemic uncertainty measures.
Paper Structure (22 sections, 4 theorems, 47 equations, 6 figures, 2 tables)

This paper contains 22 sections, 4 theorems, 47 equations, 6 figures, 2 tables.

Key Result

Theorem 3.2

Let $\mathcal{C}_1$ and $\mathcal{C}_J$ be the co-domains of $\mathcal{H}_1$ and $\mathcal{H}_J$. The following properties hold when $\mathcal{H}_2$ consists of

Figures (6)

  • Figure 1: General overview of a neural network architecture for first-order (left) and second-order (right) risk minimization. While the first-order model learns the Bernoulli parameter $\theta$ of the data generating distribution directly, the second-order model predicts the parameters $\boldsymbol{m}=(\alpha, \beta)^\top$ of a Beta distribution, which defines a distribution with density $f(\theta|x, \alpha, \beta) = \frac{1}{B(\alpha, \beta)}\theta^{\alpha -1}(1 - \theta)^{\beta -1}$ over $\theta$. The example is aligned with our experimental setup, where the feature space is one-dimensional.
  • Figure 2: Binary classification experiments for training sample size $N \in \{100, 500, 1000\}$. The true $\theta$ as a function of $x$ is shown in green. The mean estimated $\theta$ for the reference distribution and the second-order models is given in blue. Confidence bounds of the reference distribution (visualized in grey) are obtained by resampling the training data $100$ times. Confidence bounds for the models trained via second-order risk minimization (visualized in purple) are obtained by the confidence intervals of the learned Beta distribution defined by its $2.5\%$ and $97.5\%$ quantile. The black dots denote one training dataset. See main text for experimental setup.
  • Figure 3: Empirical Wasserstein-$1$ distance between the reference and the estimated second-order distributions for $\lambda \in [0.0, 0.001, 0.01, 0.05, 0.1, 0.5]$. The distance is calculated for $100$ equidistant points in $\mathcal{X}= [0,1]$, by evaluating empirical and second-order distributions in the parameter space and calculating the average $L_1$ distance.
  • Figure 4: Behavior of the average value of the predicted parameters $\widehat{\alpha}$ and $\widehat{\beta}$, obtained by the second-order learners as a function of the number of training epochs. To obtain the mean and the confidence bounds, $40$ models were trained on the same $N=1000$ instances with different random weight intializations. The average is first taken over the instance space and individual runs are plotted. In addition, the average over different runs is also shown.
  • Figure 5: Regression experiments for training sample size $N \in \{100, 500, 1000\}$. The reference model learns the parameters $\theta=(\mu, \sigma)$ of the underlying normal distribution. The true underlying mean $\mu = x^3$ and variance $\sigma^2=9$, are visualised in separate rows together with the mean predictions (in blue) and obtained confidence intervals. Confidence bounds for the predicted parameters by the reference model are obtained by resampling the training data $100$ times. Confidence bounds for $\mu$ and $\sigma^2$ learned by second-order risk minimization (visualized in purple) are obtained by the quantiles of the normal distribution $\mathcal{N}(\widehat{\mu}, \frac{\widehat{\beta}}{(\widehat{\alpha} -1)\widehat{\nu}})$, and of the Inverse-Gamma distribution $\Gamma^{-1}(\widehat{\alpha}, \widehat{\beta})$, respectively. The red dots denote one training dataset. See main text for experimental setup.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4
  • Definition 3.1
  • Theorem 3.2
  • proof