Table of Contents
Fetching ...

A Foundational Potential Energy Surface Dataset for Materials

Aaron D. Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A. Persson, Shyue Ping Ong

TL;DR

This work introduces MatPES, a high-quality foundational PES dataset for materials that comprehensively samples configuration space via 281 million MD-derived structures to yield ~16 billion atomic environments. By providing both PBE and r$^2$SCAN labeled data and employing efficient 2DIRECT sampling, MatPES enables UMLIPs to achieve state-of-the-art performance across equilibrium, near-equilibrium, and MD benchmarks with far fewer structures than prior datasets. The authors demonstrate that UMLIPs trained on MatPES surpass those trained on MPRelax and OMat24 in accuracy and robustness, including improved MD stability and ionic conductivity predictions, while advancing open science through accessible data and tooling. The work highlights data quality over quantity and proposes future expansions to cover higher-temperature/pressure regimes, defects, surfaces, and transition states. Overall, MatPES provides a scalable, community-driven foundation for reliable UMLIPs in large-scale materials discovery and design.

Abstract

Accurate potential energy surface (PES) descriptions are essential for atomistic simulations of materials. Universal machine learning interatomic potentials (UMLIPs)$^{1-3}$ offer a computationally efficient alternative to density functional theory (DFT)$^4$ for PES modeling across the periodic table. However, their accuracy today is fundamentally constrained due to a reliance on DFT relaxation data.$^{5,6}$ Here, we introduce MatPES, a foundational PES dataset comprising $\sim 400,000$ structures carefully sampled from 281 million molecular dynamics snapshots that span 16 billion atomic environments. We demonstrate that UMLIPs trained on the modestly sized MatPES dataset can rival, or even outperform, prior models trained on much larger datasets across a broad range of equilibrium, near-equilibrium, and molecular dynamics property benchmarks. We also introduce the first high-fidelity PES dataset based on the revised regularized strongly constrained and appropriately normed (r$^2$SCAN) functional$^7$ with greatly improved descriptions of interatomic bonding. The open source MatPES initiative emphasizes the importance of data quality over quantity in materials science and enables broad community-driven advancements toward more reliable, generalizable, and efficient UMLIPs for large-scale materials discovery and design.

A Foundational Potential Energy Surface Dataset for Materials

TL;DR

This work introduces MatPES, a high-quality foundational PES dataset for materials that comprehensively samples configuration space via 281 million MD-derived structures to yield ~16 billion atomic environments. By providing both PBE and rSCAN labeled data and employing efficient 2DIRECT sampling, MatPES enables UMLIPs to achieve state-of-the-art performance across equilibrium, near-equilibrium, and MD benchmarks with far fewer structures than prior datasets. The authors demonstrate that UMLIPs trained on MatPES surpass those trained on MPRelax and OMat24 in accuracy and robustness, including improved MD stability and ionic conductivity predictions, while advancing open science through accessible data and tooling. The work highlights data quality over quantity and proposes future expansions to cover higher-temperature/pressure regimes, defects, surfaces, and transition states. Overall, MatPES provides a scalable, community-driven foundation for reliable UMLIPs in large-scale materials discovery and design.

Abstract

Accurate potential energy surface (PES) descriptions are essential for atomistic simulations of materials. Universal machine learning interatomic potentials (UMLIPs) offer a computationally efficient alternative to density functional theory (DFT) for PES modeling across the periodic table. However, their accuracy today is fundamentally constrained due to a reliance on DFT relaxation data. Here, we introduce MatPES, a foundational PES dataset comprising structures carefully sampled from 281 million molecular dynamics snapshots that span 16 billion atomic environments. We demonstrate that UMLIPs trained on the modestly sized MatPES dataset can rival, or even outperform, prior models trained on much larger datasets across a broad range of equilibrium, near-equilibrium, and molecular dynamics property benchmarks. We also introduce the first high-fidelity PES dataset based on the revised regularized strongly constrained and appropriately normed (rSCAN) functional with greatly improved descriptions of interatomic bonding. The open source MatPES initiative emphasizes the importance of data quality over quantity in materials science and enables broad community-driven advancements toward more reliable, generalizable, and efficient UMLIPs for large-scale materials discovery and design.

Paper Structure

This paper contains 14 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: MatPES dataset development workflow. The number of structures at each stage in the workflow is indicated. A comprehensive configuration space was generated by performing NpT MD simulations at 300K and 1 atm on 281,572 ground-state structures and supercells obtained from the Materials Project (v2022.10.28)jain2013mp using a pre-trained M3GNet UMLIP (version MP-2021.2.8-DIRECT). A 2-stage DImensionality-Reduced Encoded Clusters with sTratified (2DIRECT) samplingqiRobustTrainingMachine2024 was then used to extract representative structures from a configuration space of $\sim$ 281 million structures with $\sim$ 16 billion atomic environments. In each cluster, the structure with the smallest number of atoms was selected to minimize the computational burden. The MD dataset was then augmented with ground-state structures with $< 100$ atoms per cell from the Materials Project to ensure coverage of equilibrium local environments. Single-point DFT calculations with stringent energy and force convergence parameters were then performed on all 504,811 structures. The periodic table heatmap indicates the number of structures containing each element and is colored on a logarithmic scale. The MatPES r$^2$SCAN dataset has similar elemental distribution (Fig. \ref{['fig:data_comp_r2scan']}).
  • Figure 2: Coverage of the MatPES PBE dataset. Distribution of PBE a, cohesive energies ($E_\mathrm{coh}$) and b, interatomic force magnitudes ($|\mathbf{F}_i|$) in the MatPES (blue), MPtrj (orange) deng2023chgnet, and OMat24 (yellow) barrosoluque2024omat datasets. The composition of the datasets are as follows: MatPES PBE: 434,712 structures (326,635 MD snapshots, 108,077 MP equilibrium structures); MPtrj: 1,580,361 structures from MP relaxations; OMat24: 1,077,382 structures. The MPtrj and OMat24 datasets contain a mixture of PBE and PBE$+U$ data, whereas MatPES PBE contains only PBE data.
  • Figure 3: Evaluation of UMLIPs on equilibrium properties. Distribution of the a, structural similarity fingerprint distance and b, formation energy per atom error between UMLIP and DFT-relaxed structures with the PBE and r$^2$SCAN functionals. A random direction perturbation was applied to all sites of 1,000 out-of-domain PBE-relaxed and r$^2$SCAN-relaxed structures randomly sampled from the WBMwangpredicting2021 and GNoMEmerchant2023gnome databases, respectively, prior to geometry optimization using UMLIPs. CrystalNNzimmerman2020crystalnn was used to compute the fingerprint distance (see Methods).
  • Figure 4: Evaluation of UMLIPs on near-equilibrium properties. Distribution of the percentage errors in the predicted a, bulk moduli ($K_{VRH}$), b, shear moduli ($G_{VRH}$), c, constant-volume heat capacities ($C_V$), and d, off-equilibrium forces ($|\mathbf{F_i}|$) of MatPES PBE, MPRelax and OMat24 UMLIPs compared to the DFT ground truth. The elastic moduli benchmarks comprises 3,959 binary compounds with computed elastic moduli in the Materials Project.jain2013mpdejong2013elastic The $C_V$ benchmark is derived within the harmonic approximation using 1,170 structures from the Alexandria phonon databaseloewphonon2024. The $|\mathbf{F_i}|$ benchmark is computed from all 979 configurations in the WBM high energy states database.wangpredicting2021
  • Figure 5: Evaluation of UMLIPs on molecular dynamics (MD) properties of the MVL-Batt test set of 172 Li and Na-containing battery materials.a, Distributions of the MD termination steps of UMLIPs based a controlled heating protocol from 300 K to 2,100 K at 1 bar over 50 ps with a 1 fs time step for the MVL-Batt test set. Simulations terminate due to volume explosion ($V_t \geq 1.5V_0$) or atom loss. Three runs were performed per model for statistical reliability. Only the M3GNet and TensorNet architectures were used for these simulations. The metric to assess MD stability is the median termination temperature $T_{1/2}^{term}$, indicated for each of the UMLIPs in the legend. b, Parity plots of the UMLIP-predicted ($\sigma_{\mathrm{MLIP}}$) against the AIMD ($\sigma_{\mathrm{DFT}}$) Li/Na ionic conductivities of the MVL-Batt test set. A total of 698 NVT MD simulations at multiple temperatures (300-2,100 K) were performed. The data points for six well-known Li and Na solid electrolyte materials at 1,000 K are labeled for reference. The $R^2$ score is calculated from the mean squared error in $\mathrm{log}(\sigma)$ to ensure a robust evaluation across multiple orders of magnitude.
  • ...and 5 more figures