How to Train a Shallow Ensemble

Moritz Schäfer; Matthias Kellner; Johannes Kästner; Michele Ceriotti

How to Train a Shallow Ensemble

Moritz Schäfer, Matthias Kellner, Johannes Kästner, Michele Ceriotti

TL;DR

This work systematically investigates training strategies for shallow ensembles to balance calibration performance with computational cost, and validates an efficient protocol: full-model fine-tuning of a shallow ensemble originally trained with a probabilistic energy loss, or one sampled from the Laplace posterior.

Abstract

Shallow ensembles provide a convenient strategy for uncertainty quantification in machine learning interatomic potentials, that is computationally efficient because the different ensemble members share a large part of the model weights. In this work, we systematically investigate training strategies for shallow ensembles to balance calibration performance with computational cost. We first demonstrate that explicit optimization of a negative log-likelihood (NLL) loss improves calibration with respect to approaches based on ensembles of randomly initialized models, or on a last-layer Laplace approximation. However, models trained solely on energy objectives yield miscalibrated force estimates. We show that explicitly modeling force uncertainties via an NLL objective is essential for reliable calibration, though it typically incurs a significant computational overhead. To address this, we validate an efficient protocol: full-model fine-tuning of a shallow ensemble originally trained with a probabilistic energy loss, or one sampled from the Laplace posterior. This approach results in negligible reduction in calibration quality compared to training from scratch, while reducing training time by up to 96%. We evaluate this protocol across a diverse range of materials, including amorphous carbon, ionic liquids (BMIM), liquid water (H$_2$O), barium titanate (BaTiO$_3$), and a model tetrapeptide (Ac-Ala3-NHMe), establishing practical guidelines for reliable uncertainty quantification in atomistic machine learning.

How to Train a Shallow Ensemble

TL;DR

Abstract

O), barium titanate (BaTiO

), and a model tetrapeptide (Ac-Ala3-NHMe), establishing practical guidelines for reliable uncertainty quantification in atomistic machine learning.

Paper Structure (24 sections, 30 equations, 13 figures, 7 tables)

This paper contains 24 sections, 30 equations, 13 figures, 7 tables.

Introduction
Methods
Machine learning interatomic potentials
Uncertainty estimation
Efficient Last-Layer Approximations
Estimating Force Uncertainty
Post-hoc Calibration and Evaluation of Uncertainty Estimates
Description of Datasets
Results
Comparing Shallow Ensemble and LLPR Energy Uncertainty Estimates
Calibrated Force Uncertainties
Efficient Generation of Last-Layer Ensembles
Conclusions and Outlook
Data Availability
Model Architectures
...and 9 more sections

Figures (13)

Figure 1: Predicted-empirical error plots of various energy uncertainty estimation approaches for the BMIM dataset. Panel a) compares SE$_{E}$ and LLPR$_{E}$, Panel b) compares last-layer and full model fine tuning for ensembles sampled from the LLPR$_{E}$ posterior.
Figure 2: Predicted-empirical error parity plots for force components in the reshuffled BMIM test set. Points are colored by element type to visualize species-dependent calibration. Panels (a) and (b) show energy-only calibrated models (LLPR$_{E}$ and SE$_{E}$), which exhibit systematic miscalibration for Boron and Fluorine atoms (anion). Panels (c) and (d) show the corresponding force-informed models (LLPR$_{E,F}$ and SE$_{E,F}$).
Figure 3: Normalized eigenvalue spectra of the per-element force loss Hessian with respect to the last-layer weights for a) the MSE-trained model used for LLPR$_{E,F}$ and b) the NLL-trained SE$_{E,F}$.
Figure 4: Relative log likelihoods of different last-layer ensemble initialization and fine-tuning strategies compared to training shallow ensembles from scratch. a) Relative Energy log-likelihoods across 5 datasets. b) Relative Force log-likelihoods. Both are averaged over 3 random seeds. Negative RLLs are only displayed down to -100 for easier interpretability. Panels c) and d) show the SE$_{E}$ fine-tuning results for BMIM and BaTiO3 with highlighted outliers.
Figure 5: Training time savings in percent for SE$_{E}$ training followed by fine-tuning compared to training with a force NLL loss from scratch across the 5 benchmark datasets. Panel a) shows the savings for GMNN with full-model fine tuning (red) and panel b) shows the savings for EquivMP with last-layer fine tuning (orange). The base model training time fractions are indicated in blue.
...and 8 more figures

How to Train a Shallow Ensemble

TL;DR

Abstract

How to Train a Shallow Ensemble

Authors

TL;DR

Abstract

Table of Contents

Figures (13)