Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory
Daniel Schwalbe-Koda, Sebastien Hamel, Babak Sadigh, Fei Zhou, Vincenzo Lordi
TL;DR
This work addresses the need for rigorous, model-free quantification of information content, uncertainty, and outliers in atomistic machine learning by introducing QUESTS, a kernel-density-based entropy framework over atom-centered descriptors.The authors formulate an atomistic information entropy $\mathcal{H}$ and a differential entropy $\delta\mathcal{H}$ to quantify dataset completeness, learning efficiency, and extrapolation risk without training any model, enabling robust UQ and outlier detection.Key contributions include linking entropy to learning curves on molecular and solid-state datasets, introducing dataset diversity $D$ for active-learning analysis, and applying $\delta\mathcal{H}$ to detect outliers in giant MD simulations and to rationalize TM23 transferability trends.The approach provides practical tools for dataset design, active-learning strategies, and real-time monitoring of ML-driven atomistic simulations, with potential to improve reliability and interpretability in materials modeling.
Abstract
An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.
