High-quality, high-information datasets for universal atomistic machine learning

Cesare Malosso; Filippo Bigi; Paolo Pegolo; Joseph W. Abbott; Philip Loche; Mariana Rossi; Michele Ceriotti; Arslan Mazitov

High-quality, high-information datasets for universal atomistic machine learning

Cesare Malosso, Filippo Bigi, Paolo Pegolo, Joseph W. Abbott, Philip Loche, Mariana Rossi, Michele Ceriotti, Arslan Mazitov

TL;DR

The high accuracy that can be achieved with the proposed dataset is demonstrated by training PET-MAD-1.5, a generally applicable r$^2$SCAN interatomic potential that covers 102 elements in the periodic table and achieves exceptional levels of benchmark accuracy and stability in challenging simulation protocols.

Abstract

The quality, consistency, and information content of training data is often what determines the practical value of machine-learning models for atomistic simulations. Yet, many widely used electronic-structure databases are assembled having materials screening as primary goal rather than robust force-field learning, are limited in their scope to a specific class of chemical compounds, and/or employ inconsistent DFT functionals and settings. Here we introduce MAD-1.5, a highly curated dataset designed explicitly for training broadly applicable atomistic models across the periodic table at high levels of theory. MAD-1.5 extends the MAD dataset with targeted enrichment strategies that improve the coverage of chemical space to 102 elements while keeping the total number of configurations compact. All structures are computed with a single, standardized all-electron DFT workflow using the r$^2$SCAN meta-GGA functional and consistent convergence settings, ensuring uniformity across chemically heterogeneous systems. The dataset encompasses molecules, clusters, bulk crystals, surfaces, and low-dimensional structures, and its quality and consistency are further enhanced by outlier removal using uncertainty quantification. We demonstrate the high accuracy that can be achieved with the proposed dataset by training PET-MAD-1.5, a generally applicable r$^2$SCAN interatomic potential that covers 102 elements in the periodic table and achieves exceptional levels of benchmark accuracy and stability in challenging simulation protocols.

High-quality, high-information datasets for universal atomistic machine learning

TL;DR

The high accuracy that can be achieved with the proposed dataset is demonstrated by training PET-MAD-1.5, a generally applicable r

SCAN interatomic potential that covers 102 elements in the periodic table and achieves exceptional levels of benchmark accuracy and stability in challenging simulation protocols.

Abstract

SCAN meta-GGA functional and consistent convergence settings, ensuring uniformity across chemically heterogeneous systems. The dataset encompasses molecules, clusters, bulk crystals, surfaces, and low-dimensional structures, and its quality and consistency are further enhanced by outlier removal using uncertainty quantification. We demonstrate the high accuracy that can be achieved with the proposed dataset by training PET-MAD-1.5, a generally applicable r

SCAN interatomic potential that covers 102 elements in the periodic table and achieves exceptional levels of benchmark accuracy and stability in challenging simulation protocols.

Paper Structure (13 sections, 1 equation, 4 figures, 5 tables)

This paper contains 13 sections, 1 equation, 4 figures, 5 tables.

Introduction
Dataset construction
Composition
Electronic-structure details
Outlier detection
Models and benchmarks
Model architecture and training
Uncertainty quantification
Benchmarking
Mendeleev clusters
Conclusions
Data record and model availability
Author contributions

Figures (4)

Figure 1: Periodic table indicating the statistical representation of the elements present in the MAD-1.5 dataset. Each element tile contains the proton number, element symbol, and elemental frequency in MAD-1.5, by which it is also colored. Elements in gray are not present in the dataset, and those bordered in red have been newly introduce to MAD-1.5 compared the original MAD-1 dataset.
Figure 2: Visualization of the dataset cleaning based on predicted LLPR uncertainties. All the structures for which the actual absolute error in energy predictions is 3 times higher than the predicted energy uncertainty are filtered out. A total of 8 244 structures filtered from the pre-cleaned version of the dataset are marked with red circles. The rest of the dataset is represented with gray dots.
Figure 3: Inference time of PET-MAD universal models evaluated over different kinds of materials and varying system sizes. The results of the XS- and S-size PET-MAD-1.5 models is shown with blue and orange lines, respectively, where circle and square markers are denoting the ASE ase and LAMMPS (with Kokkos backend) lammpskokkos implementations. The original PET-MAD-1 timings with both the model weights and code implementations from Ref. mazitov2025pet are shown with gray lines. The PET-MAD-1 timings with the updated code implementation associated with the PET-MAD-1.5 release are shown in black. All performance test were done on a single NVIDIA H100 GPU. Conservative heads were used for all the predictions.
Figure 4: (top) Time series of the potential energy for the re-ordered temperature replicas for a REMD simulation of a Mendeleev cluster containing 102 elements. Insets show the initial and final structures of the 300 K replica, and the final structure of the 3000 K replica. Isolated atoms in the end of the simulation are typically either noble gases or elements with a high vapor pressure (such as Hg). (bottom) Parity plot of the force components for the final structures in the simulation, color-coded based on the target temperature, comparing the values obtained with PET-MAD-1.5-S with those computed with single-point DFT calculations. Typical errors are of the order of 150 meV/Å MAE.

High-quality, high-information datasets for universal atomistic machine learning

TL;DR

Abstract

High-quality, high-information datasets for universal atomistic machine learning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)