Table of Contents
Fetching ...

On Uncertainty Quantification for Near-Bayes Optimal Algorithms

Ziyu Wang, Chris Holmes

TL;DR

It is proved that it is possible to recover the Bayesian posterior defined by the task distribution, which is unknown but optimal in this setting, by building a martingale posterior using the algorithm.

Abstract

Bayesian modelling allows for the quantification of predictive uncertainty which is crucial in safety-critical applications. Yet for many machine learning (ML) algorithms, it is difficult to construct or implement their Bayesian counterpart. In this work we present a promising approach to address this challenge, based on the hypothesis that commonly used ML algorithms are efficient across a wide variety of tasks and may thus be near Bayes-optimal w.r.t. an unknown task distribution. We prove that it is possible to recover the Bayesian posterior defined by the task distribution, which is unknown but optimal in this setting, by building a martingale posterior using the algorithm. We further propose a practical uncertainty quantification method that apply to general ML algorithms. Experiments based on a variety of non-NN and NN algorithms demonstrate the efficacy of our method.

On Uncertainty Quantification for Near-Bayes Optimal Algorithms

TL;DR

It is proved that it is possible to recover the Bayesian posterior defined by the task distribution, which is unknown but optimal in this setting, by building a martingale posterior using the algorithm.

Abstract

Bayesian modelling allows for the quantification of predictive uncertainty which is crucial in safety-critical applications. Yet for many machine learning (ML) algorithms, it is difficult to construct or implement their Bayesian counterpart. In this work we present a promising approach to address this challenge, based on the hypothesis that commonly used ML algorithms are efficient across a wide variety of tasks and may thus be near Bayes-optimal w.r.t. an unknown task distribution. We prove that it is possible to recover the Bayesian posterior defined by the task distribution, which is unknown but optimal in this setting, by building a martingale posterior using the algorithm. We further propose a practical uncertainty quantification method that apply to general ML algorithms. Experiments based on a variety of non-NN and NN algorithms demonstrate the efficacy of our method.
Paper Structure (61 sections, 2 theorems, 51 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 61 sections, 2 theorems, 51 equations, 4 figures, 9 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $\pi_{n}, \hat{p}_{mp,n}$ be defined as above, and $W_{2,\theta}$ be the 2-Wasserstein distance w.r.t. $\|\cdot\|$. Under Asm. asm:approx-martingale-asm:conventions, there exists some $C>0$ determined by $(C_\Theta,C_{\mathcal{A}},C_{\mathcal{A}}',L_1,L_2)$ s.t. for $\chi_n = C/(sn^s) \to 0$ we Consequently, if $N\gg n$ is sufficiently large so that $\bar{\varepsilon}_{B,N}\ll \bar{\varepsilo

Figures (4)

  • Figure 1: GP inference on the Snelson dataset: visualisation of the approximate MP defined by Eq. \ref{['eq:gp-alg-spo']}, compared with the ensemble predictors defined by a modified MAP estimator with similar initialisation randomness (Eq. \ref{['eq:map-anchoring']}). Solid line and shade indicate the mean estimate and $80\%$ pointwise credible intervals (CIs) for the true regression function. Dashed line indicates the $80\%$ CIs from the exact posterior. Dots at bottom indicate the location of training inputs.
  • Figure 2: Multi-task learning simulation: results with varying choices of $(m,n_{pret},n_{test})$. Plotted are the mean and 95% confidence interval (CI) for each metric. CIs are computed on 160 replications using normal approximation (first two subplots) or the Wilson score (last subplot).
  • Figure 3: Classification experiment: scatter plot of the test metrics (for each dataset averaged over 10 random splits; higher is better) for the base algorithm vs the proposed method.
  • Figure 4: Classification experiment: approximate MP for the GDBT feature importance scores and their pairwise correlations. Plotted are the top 5 features in the UCI adult dataset.

Theorems & Definitions (13)

  • Remark 2.1: supervised learning
  • Remark 2.2: identifiability and semi-norm
  • Theorem 3.1: proof in App. \ref{['app:proof-thm-main']}
  • Remark 3.1
  • Example A.1: comparison to nonparametric bootstrap
  • Example A.2: comparison to parametric bootstrap
  • Corollary B.1
  • proof
  • Claim B.1
  • proof
  • ...and 3 more