A Foundational Potential Energy Surface Dataset for Materials
Aaron D. Kaplan, Runze Liu, Ji Qi, Tsz Wai Ko, Bowen Deng, Janosh Riebesell, Gerbrand Ceder, Kristin A. Persson, Shyue Ping Ong
TL;DR
This work introduces MatPES, a high-quality foundational PES dataset for materials that comprehensively samples configuration space via 281 million MD-derived structures to yield ~16 billion atomic environments. By providing both PBE and r$^2$SCAN labeled data and employing efficient 2DIRECT sampling, MatPES enables UMLIPs to achieve state-of-the-art performance across equilibrium, near-equilibrium, and MD benchmarks with far fewer structures than prior datasets. The authors demonstrate that UMLIPs trained on MatPES surpass those trained on MPRelax and OMat24 in accuracy and robustness, including improved MD stability and ionic conductivity predictions, while advancing open science through accessible data and tooling. The work highlights data quality over quantity and proposes future expansions to cover higher-temperature/pressure regimes, defects, surfaces, and transition states. Overall, MatPES provides a scalable, community-driven foundation for reliable UMLIPs in large-scale materials discovery and design.
Abstract
Accurate potential energy surface (PES) descriptions are essential for atomistic simulations of materials. Universal machine learning interatomic potentials (UMLIPs)$^{1-3}$ offer a computationally efficient alternative to density functional theory (DFT)$^4$ for PES modeling across the periodic table. However, their accuracy today is fundamentally constrained due to a reliance on DFT relaxation data.$^{5,6}$ Here, we introduce MatPES, a foundational PES dataset comprising $\sim 400,000$ structures carefully sampled from 281 million molecular dynamics snapshots that span 16 billion atomic environments. We demonstrate that UMLIPs trained on the modestly sized MatPES dataset can rival, or even outperform, prior models trained on much larger datasets across a broad range of equilibrium, near-equilibrium, and molecular dynamics property benchmarks. We also introduce the first high-fidelity PES dataset based on the revised regularized strongly constrained and appropriately normed (r$^2$SCAN) functional$^7$ with greatly improved descriptions of interatomic bonding. The open source MatPES initiative emphasizes the importance of data quality over quantity in materials science and enables broad community-driven advancements toward more reliable, generalizable, and efficient UMLIPs for large-scale materials discovery and design.
