Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory

Daniel Schwalbe-Koda; Sebastien Hamel; Babak Sadigh; Fei Zhou; Vincenzo Lordi

Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory

Daniel Schwalbe-Koda, Sebastien Hamel, Babak Sadigh, Fei Zhou, Vincenzo Lordi

TL;DR

This work addresses the need for rigorous, model-free quantification of information content, uncertainty, and outliers in atomistic machine learning by introducing QUESTS, a kernel-density-based entropy framework over atom-centered descriptors.The authors formulate an atomistic information entropy $\mathcal{H}$ and a differential entropy $\delta\mathcal{H}$ to quantify dataset completeness, learning efficiency, and extrapolation risk without training any model, enabling robust UQ and outlier detection.Key contributions include linking entropy to learning curves on molecular and solid-state datasets, introducing dataset diversity $D$ for active-learning analysis, and applying $\delta\mathcal{H}$ to detect outliers in giant MD simulations and to rationalize TM23 transferability trends.The approach provides practical tools for dataset design, active-learning strategies, and real-time monitoring of ML-driven atomistic simulations, with potential to improve reliability and interpretability in materials modeling.

Abstract

An accurate description of information is relevant for a range of problems in atomistic machine learning (ML), such as crafting training sets, performing uncertainty quantification (UQ), or extracting physical insights from large datasets. However, atomistic ML often relies on unsupervised learning or model predictions to analyze information contents from simulation or training data. Here, we introduce a theoretical framework that provides a rigorous, model-free tool to quantify information contents in atomistic simulations. We demonstrate that the information entropy of a distribution of atom-centered environments explains known heuristics in ML potential developments, from training set sizes to dataset optimality. Using this tool, we propose a model-free UQ method that reliably predicts epistemic uncertainty and detects out-of-distribution samples, including rare events in systems such as nucleation. This method provides a general tool for data-driven atomistic modeling and combines efforts in ML, simulations, and physical explainability.

Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory

TL;DR

Abstract

Paper Structure (37 sections, 39 equations, 36 figures, 1 table)

This paper contains 37 sections, 39 equations, 36 figures, 1 table.

Introduction
Results
Formulation of an atomistic information entropy
Information-theoretical dataset analysis for machine learning potentials
Relating information contents to learning curves in molecular datasets
Dataset completeness in solid-state systems
Information efficiency and diversity in active learning loops
Model-free uncertainty quantification for machine learning potentials
Information theory explains chemical and error trends across the TM23 dataset
Information-based detection of outliers and rare events in atomistic simulations
Discussion
Conclusions
Supplementary Text
Derivation of the descriptor
Radial terms
...and 22 more sections

Figures (36)

Figure 1: Overview of the method.a, Typical workflow in MLIPs for training, evaluating, and retraining models that predict potential energy surfaces (PES). Challenges in the process are highlighted in magenta. b, Overview of our QUESTS method, which computes the information entropy of a non-parametric descriptor distribution.
Figure 2: Information entropy measures dataset completeness, compressibility, and sample efficiency in MLIPs.a, Information entropy for three example molecules from the rMD17 dataset as a function of the dataset size. Simpler molecules exhibit lower entropy and converge faster, while more diverse molecules require more samples to converge. b, correlation between the error in predicted forces and the information gap for all molecules in the rMD17. The errors were obtained from the original reference for MACE. A circle indicates errors when 1000 samples are used to train the models, and crosses are errors when only 50 samples are used to train the models. $\rho$ is the Pearson's correlation coefficient. c, information entropy (blue bars) of selected subsets of the carbon GAP-20 dataset. The maximum entropy is given by $\log n$ (gray bars), where $n$ is the number of atomic environments. The results are sorted by ascending dataset entropy. d, information gap obtained by compressing the "Fullerenes" and "Graphene" subsets of GAP-20 by up to 20% of their original sizes. While the information gap of "Graphene" remains close to zero, the one from "Fullerenes" monotonically increases as the dataset size decreases. e, test errors relative to the errors obtained when a MACE model is trained on the full subset of GAP-20. The results show that the "Graphene" subset can be compressed by up to 20% of its size without loss of performance, whereas this is not the case for the "Fullerenes" subset. f, information entropy and diversity for the ANI-Al dataset computed for each generation of active learning. Oversampling of certain phases leads to a total reduction of entropy, as demonstrated by g, showing decreasing novelty in the samples. In this approach, novelty is the fraction of environments showing $\delta \mathcal{H} > 0$ when the dataset of all previous generations are taken as reference. Nevertheless, the diversity of the dataset continues to increase.
Figure 3: Information entropy quantifies overlaps between datasets and is a model-free UQ method.a, Overlap between test and reference sets for the GAP-20 carbon dataset. Only a subset of the data is shown for clarity (see Fig. \ref{['fig:si:04-gap20-dH-table']} for complete matrix). b, Test errors (in %) of a MACE model trained on one of the subsets of the GAP-20 dataset shown in a, and tested on the other subsets. Each point corresponds to a single (train, test) pair. Models with higher overlaps between train and test sets exhibit substantially smaller errors. c, Correlation between force errors (in eV/Å) and $\delta \mathcal{H}$ for models trained on the "Defects" subset of GAP-20. The average RMSE (orange line) increases with higher $\delta \mathcal{H}$. For "Fullerenes", The $\delta \mathcal{H}$ was truncated to 100 nats for clarity.
Figure 4: Information theoretical quantities correlate with error and chemical trends in the TM23 dataset.a, Information entropy of the full TM23 training set for each element. b, Force errors (in %) for NequIP models trained on the full training set, obtained from Owen et al.c, These two quantities exhibit strong correlation, as indicated by the Pearson correlation coefficient of $\rho = 0.79$ for transition metals with incomplete d-shell. d, The difference between the final forces error (in %) and the initial forces errors (in %) (denoted as $\Delta$Error) is explained by the dataset overlap obtained by computing $\delta \mathcal{H} (0.75 T_m | 0.25 T_m)$ or $\delta \mathcal{H} (1.25 T_m | 0.25 T_m)$, as demonstrated by the Pearson correlation coefficient of $\rho = -0.85$. Red and blue dots indicate error differences for models trained on the "cold" subset of TM23 (sampled at 0.25 $T_m$, where $T_m$ is the melting temperature) and tested on the "warm" (0.75 $T_m$) and "melt" (1.25 $T_m$), respectively.
Figure 5: Differential entropies detect outliers and rare-events in large atomistic simulations.a Visualization of a 32.5M atom snapshot of BCC Ta simulated with SNAP. Colors represent the values of the estimated $\delta \mathcal{H}$, with blue atoms indicating environments reasonably within the training set ($\delta \mathcal{H} < 0$) and red atoms indicating environments outside of the training set ($\delta \mathcal{H} > 0$). Values of $\delta \mathcal{H}$ were truncated to the range $[-5, 5]$ to facilitate the visualization of divergent colors. b, Example of high-uncertainty region encountered during the simulation. The formation of a disordered, non-BCC phase (red) in the simulation leads to unphysical behavior in the trajectory. c, The unphysical behavior cannot be identified only by errors in forces. Even outside of its known domain, the SNAP model exhibits errors within the range of systems within the training set. The number of environments in each region is shown by the color scale, with brighter colors indicating exponentially denser regions of the error-$\delta \mathcal{H}$ space.
...and 31 more figures

Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory

TL;DR

Abstract

Model-free quantification of completeness, uncertainties, and outliers in atomistic machine learning using information theory

Authors

TL;DR

Abstract

Table of Contents

Figures (36)