P-DRUM: Post-hoc Descriptor-based Residual Uncertainty Modeling for Machine Learning Potentials
Shih-Peng Huang, Nontawat Charoenphakdee, Yuta Tsuboi, Yong-Bin Zhuang, Wenwen Li
TL;DR
The study tackles uncertainty quantification for machine learning interatomic potentials by introducing P-DRUM, a post-hoc framework that derives residual uncertainty from descriptors of a trained GNN potential. P-DRUM trains atom-wise residual models to estimate energy residuals \Delta E and force residuals \Delta F using either norm-based or deviation-based targets, aggregating over atoms to yield structure-level uncertainty proxies. Across multiple datasets (HME21, rMD17, Ni3Al), P-DRUM variants show strong alignment with actual prediction errors, sometimes surpassing baselines like kNN and GMM in in-domain uncertainty correlation, while OOD detection reveals tradeoffs between the norm and diff formulations. The work demonstrates a practical, efficient alternative to ensembles for UQ in MLIPs and points to future directions in active learning and data-efficient residual training to extend applicability and reliability.
Abstract
Ensemble method is considered the gold standard for uncertainty quantification (UQ) in machine learning interatomic potentials (MLIPs). However, their high computational cost can limit its practicality. Alternative techniques, such as Monte Carlo dropout and deep kernel learning, have been proposed to improve computational efficiency; however, some of these methods cannot be applied to already trained models and may affect the prediction accuracy. In this paper, we propose a simple and efficient post-hoc framework for UQ that leverages the descriptor of a trained graph neural network potential to estimate residual errors. We refer to this method as post-hoc descriptor-based residual uncertainty modeling (P-DRUM). P-DRUM models the discrepancy between MLIP predictions and ground truth values, allowing these residuals to act as proxies for prediction uncertainty. We explore multiple variants of P-DRUM and benchmark them against established UQ methods, evaluating both their effectiveness and limitations.
