Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

Janosh Riebesell; Rhys E. A. Goodall; Philipp Benner; Yuan Chiang; Bowen Deng; Gerbrand Ceder; Mark Asta; Alpha A. Lee; Anubhav Jain; Kristin A. Persson

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

Janosh Riebesell, Rhys E. A. Goodall, Philipp Benner, Yuan Chiang, Bowen Deng, Gerbrand Ceder, Mark Asta, Alpha A. Lee, Anubhav Jain, Kristin A. Persson

TL;DR

Matbench Discovery introduces a task-based benchmarking framework to evaluate ML energy models for accelerating inorganic materials discovery by predicting thermodynamic stability via hull distance rather than formation energy. It enables flexible training data sourced from the Materials Project and a large WBM test set that probes out-of-domain performance, with an online leaderboard to compare methods. Across 13 models, universal interatomic potentials trained on energies, forces, and stresses outperform energy-only approaches in classification, achieving high F1 scores and substantial discovery acceleration, while regression metrics can mislead about task performance near the stability boundary. The work underscores the need for prospective, task-aligned benchmarks in ML-guided materials discovery and points to future gains from larger, higher-fidelity training data and expanded criteria beyond zero-Kelvin stability.

Abstract

The rapid adoption of machine learning (ML) in domain sciences necessitates best practices and standardized benchmarking for performance evaluation. We present Matbench Discovery, an evaluation framework for ML energy models, applied as pre-filters for high-throughput searches of stable inorganic crystals. This framework addresses the disconnect between thermodynamic stability and formation energy, as well as retrospective vs. prospective benchmarking in materials discovery. We release a Python package to support model submissions and maintain an online leaderboard, offering insights into performance trade-offs. To identify the best-performing ML methodologies for materials discovery, we benchmarked various approaches, including random forests, graph neural networks (GNNs), one-shot predictors, iterative Bayesian optimizers, and universal interatomic potentials (UIP). Our initial results rank models by test set F1 scores for thermodynamic stability prediction: EquiformerV2 + DeNS > Orb > SevenNet > MACE > CHGNet > M3GNet > ALIGNN > MEGNet > CGCNN > CGCNN+P > Wrenformer > BOWSR > Voronoi fingerprint random forest. UIPs emerge as the top performers, achieving F1 scores of 0.57-0.82 and discovery acceleration factors (DAF) of up to 6x on the first 10k stable predictions compared to random selection. We also identify a misalignment between regression metrics and task-relevant classification metrics. Accurate regressors can yield high false-positive rates near the decision boundary at 0 eV/atom above the convex hull. Our results demonstrate UIPs' ability to optimize computational budget allocation for expanding materials databases. However, their limitations remain underexplored in traditional benchmarks. We advocate for task-based evaluation frameworks, as implemented here, to address these limitations and advance ML-guided materials discovery.

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

TL;DR

Abstract

Paper Structure (22 sections, 16 figures, 3 tables)

This paper contains 22 sections, 16 figures, 3 tables.

Introduction
Evaluation Framework for Materials Discovery
Matbench Discovery
Materials Project Training Set
WBM Test Set
Limitations of this Framework
Models
Results
Discussion
Acknowledgments
Author Contributions
Code availability
Data availability
Supplementary Information
Metrics on full test set and for 10k materials predicted most stable
...and 7 more sections

Figures (16)

Figure 1: An overview of how data is used in Matbench-Discovery. a) shows a conventional prototype-based discovery workflow where different elemental assignments to the sites in a known prototype are used to create a candidate structure. This candidate is relaxed using DFT to arrive at a relaxed structure that can be compared against a reference convex hull. This sort of workflow was used to construct the WBM data set. b) highlights that databases such as the Materials Project provide a rich set of data which different academic groups have used to explore different types of models. While prior work tended to focus on individual modalities, our framework enables consistent model comparisons across modalities. c) shows the proposed test evaluation framework where the end user takes a machine learning model and uses it to predict a relaxed energy given an initial structure (IS2RE). This energy is then used to make a prediction as to whether the material will be stable or unstable with respect to a reference convex hull. From an applications perspective, this classification performance is better aligned with intended use cases in screening workflows.
Figure 2: This figure measures model utility for materials discovery campaigns of varying sizes by plotting the precision and recall as a function of the number of model predictions validated. A typical discovery campaign will rank hypothetical materials by model-predicted hull distance from most to least stable and validate the most stable predictions first. A higher fraction of correct stable predictions corresponds to higher precision and fewer stable materials overlooked correspond to higher recall. Precision is calculated based only on the selected materials up to that point, whilst the cumulative recall depends on knowing the total number of positives upfront. This figure highlights how different models perform better or worse depending on the length of the discovery campaign. The UIPs are seen to offer significantly improved precision on shorter campaigns of $\mathord{\sim}$20k or less materials validated as they are less prone to false positive predictions among highly stable materials.
Figure 3: Universal potentials are more reliable classifiers because they exit the red triangle earliest. These lines show the rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied. Lower is better. Inside the large red 'triangle of peril', models are most likely to misclassify structures. As long as a model's rolling MAE remains inside the triangle, its mean error is larger than the distance to the convex hull. If the model's error for a given prediction happens to point towards the stability threshold at $E_\text{above MP hull} = 0$, its average error will change the stability classification from true positive or negative to false negative or positive. The width of the 'rolling window' box indicates the width over which prediction errors were averaged.
Figure S1: Receiver operating characteristic (ROC) curve for each model. The false positive rate (FPR) on the $x$ axis is the fraction of unstable structures classified as stable. The true positive rate (TPR) on the $y$ axis is the fraction of stable structures classified as stable.
Figure S2: Parity plots of model-predicted energy distance to the convex hull (based on their formation energy predictions) vs DFT ground truth, color-coded by log density of points. Models are sorted left to right and top to bottom by MAE. For parity plots of formation energy predictions, see \ref{['fig:e-form-parity-models']}.
...and 11 more figures

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

TL;DR

Abstract

Matbench Discovery -- A framework to evaluate machine learning crystal stability predictions

Authors

TL;DR

Abstract

Table of Contents

Figures (16)