Table of Contents
Fetching ...

When Active Learning Fails, Uncalibrated Out of Distribution Uncertainty Quantification Might Be the Problem

Ashley S. Dale, Kangming Li, Brian DeCost, Hao Wan, Yuchen Han, Yao Fehlis, Jason Hattrick-Simpers

TL;DR

Analysis of the target, in-distribution uncertainty, out-of-distribution uncertainty, and training residual distributions suggest that future work focus on understanding empirical uncertainties in the feature input space for cases where ensemble prediction variances do not accurately capture the missing information required for the model to generalize.

Abstract

Efficiently and meaningfully estimating prediction uncertainty is important for exploration in active learning campaigns in materials discovery, where samples with high uncertainty are interpreted as containing information missing from the model. In this work, the effect of different uncertainty estimation and calibration methods are evaluated for active learning when using ensembles of ALIGNN, eXtreme Gradient Boost, Random Forest, and Neural Network model architectures. We compare uncertainty estimates from ALIGNN deep ensembles to loss landscape uncertainty estimates obtained for solubility, bandgap, and formation energy prediction tasks. We then evaluate how the quality of the uncertainty estimate impacts an active learning campaign that seeks model generalization to out-of-distribution data. Uncertainty calibration methods were found to variably generalize from in-domain data to out-of-domain data. Furthermore, calibrated uncertainties were generally unsuccessful in reducing the amount of data required by a model to improve during an active learning campaign on out-of-distribution data when compared to random sampling and uncalibrated uncertainties. The impact of poor-quality uncertainty persists for random forest and eXtreme Gradient Boosting models trained on the same data for the same tasks, indicating that this is at least partially intrinsic to the data and not due to model capacity alone. Analysis of the target, in-distribution uncertainty, out-of-distribution uncertainty, and training residual distributions suggest that future work focus on understanding empirical uncertainties in the feature input space for cases where ensemble prediction variances do not accurately capture the missing information required for the model to generalize.

When Active Learning Fails, Uncalibrated Out of Distribution Uncertainty Quantification Might Be the Problem

TL;DR

Analysis of the target, in-distribution uncertainty, out-of-distribution uncertainty, and training residual distributions suggest that future work focus on understanding empirical uncertainties in the feature input space for cases where ensemble prediction variances do not accurately capture the missing information required for the model to generalize.

Abstract

Efficiently and meaningfully estimating prediction uncertainty is important for exploration in active learning campaigns in materials discovery, where samples with high uncertainty are interpreted as containing information missing from the model. In this work, the effect of different uncertainty estimation and calibration methods are evaluated for active learning when using ensembles of ALIGNN, eXtreme Gradient Boost, Random Forest, and Neural Network model architectures. We compare uncertainty estimates from ALIGNN deep ensembles to loss landscape uncertainty estimates obtained for solubility, bandgap, and formation energy prediction tasks. We then evaluate how the quality of the uncertainty estimate impacts an active learning campaign that seeks model generalization to out-of-distribution data. Uncertainty calibration methods were found to variably generalize from in-domain data to out-of-domain data. Furthermore, calibrated uncertainties were generally unsuccessful in reducing the amount of data required by a model to improve during an active learning campaign on out-of-distribution data when compared to random sampling and uncalibrated uncertainties. The impact of poor-quality uncertainty persists for random forest and eXtreme Gradient Boosting models trained on the same data for the same tasks, indicating that this is at least partially intrinsic to the data and not due to model capacity alone. Analysis of the target, in-distribution uncertainty, out-of-distribution uncertainty, and training residual distributions suggest that future work focus on understanding empirical uncertainties in the feature input space for cases where ensemble prediction variances do not accurately capture the missing information required for the model to generalize.

Paper Structure

This paper contains 36 sections, 7 equations, 67 figures, 4 tables.

Figures (67)

  • Figure 1: Example use of a loss landscape to generate uncertainty predictions. After training, a model's weights $\theta$ are perturbed along orthogonal directions $\alpha, \beta$ in the weight space. The perturbed model's performance is quantified using the model loss, forming the loss landscape. Models with similar performance in the loss landscape are reserved. The average $\mu_\theta$ and standard deviation $\sigma_\theta$ of each parameter are reserved to create an independent Gaussian approximation of the posterior distribution over model parameters $\mathcal{N} \left( \mu_\theta, \sigma_\theta \right)$. Additional models are sampled from this distribution, and used to generate a distribution of predictions with uncertainty $\sigma_{Y}$.
  • Figure 2: Loss Landscapes. Each column represents a different data study: Column 1 (a, f, k) is the solubility prediction task, column 2 (b, g, l) the formation energy $E_f$ task with fluorine (F) omitted from the training data, column 3 (c, h, m) is the formation energy task $E_f$ with iron (Fe) omitted from the training data, column 4 (d, i, n) is the bandgap $E_g$ prediction task with F omitted from the training data, and column 5 (e, j, o) is the bandgap $E_g$ prediction task with Fe omitted from the training data. Row 1: loss landscapes from training data. Row 2: loss landscapes from ID test data. Row 3: loss landscapes from OOD data. The colorbar visualizes the log loss for each task, while the x-axis and y-axis are two directions in the weight space of the original model.
  • Figure 3: Comparison of uncertainty and predictions from LL-ensemble (a-j) and random-ensemble (k-t) for the formation energy $E_f$ task when F-containing compounds are omitted from the training data. Column 1 is the prediction residuals on the in-distribution (blue) and out-of-distribution (orange) data. Column 2 is the uncertainty estimates for loss landscape (b, g) and ensemble (i, q) methods; the original (OG) uncertainty distribution is shown in red, the neural network (NN) calibration method in green, and the calibration factor (CF) method shown in blue. Columns 3-5 show the uncertainty distributions of column 2 as error bars on parity plots.
  • Figure 4: Model trained omitting F-containing compounds to predict formation energy $E_f$, yielding a comparison of neural network (NN) and calibration factor (CF) calibration methods to the original uncertainty (OG) distributions. (a) ID-test uncertainties from LL-ensemble. (b) OOD uncertainties from LL-ensemble. (c) ID uncertainties from random ensemble. (d) OOD uncertainties from random ensemble. (e) Calibration errors from LL uncertainty estimates in (a) and (b). (f) Calibration errors from LL uncertainty estimates in (a) and (b). (g) Calibration errors from random ensemble uncertainty estimates in (c) and (d). (h) Miscalibration area from random ensemble uncertainty estimates in (c) and (d).
  • Figure 5: Model trained omitting Fe-containing compounds to predict formation energy $E_f$, yielding a comparison of neural network (NN) and calibration factor (CF) calibration methods to the original uncertainty (OG) distributions. (a) ID-test uncertainties from LL-ensemble. (b) OOD uncertainties from LL-ensemble. (c) OOD uncertainties from random ensemble. (d) OOD uncertainties from random ensemble. (e) Calibration errors from LL uncertainty estimates in (a) and (b). (f) Calibration errors from LL uncertainty estimates in (a) and (b). (g) Calibration errors from random ensemble uncertainty estimates in (c) and (d). (h) Miscalibration area from random ensemble uncertainty estimates in (c) and (d).
  • ...and 62 more figures