Table of Contents
Fetching ...

Surface Stability Modeling with Universal Machine Learning Interatomic Potentials: A Comprehensive Cleavage Energy Benchmarking Study

Ardavan Mehdizadeh, Peter Schindler

TL;DR

This work addresses the challenge of predicting cleavage energies with universal interatomic potentials. It introduces a large-scale benchmark of 19 uMLIPs across 36,718 DFT-derived slab structures, showing that training data composition, especially non-equilibrium configurations from the Open Materials 2024 dataset, dominates predictive accuracy over architectural sophistication. Substantial performance gains are achieved with relatively simple architectures when trained on diverse data, achieving sub-6% mean absolute percentage error and high accuracy in identifying thermodynamically stable surface terminations without explicit surface-energy training. The findings advocate a data-centric development paradigm for foundational potentials and highlight practical implications for fast, reliable surface and interfacial property predictions in materials design. The study also delineates limitations and future directions, including broader chemistries, higher-index surfaces, and uncertainty quantification, to further empower high-throughput surface science workflows.

Abstract

Machine learning interatomic potentials (MLIPs) have revolutionized computational materials science by bridging the gap between quantum mechanical accuracy and classical simulation efficiency, enabling unprecedented exploration of materials properties across the periodic table. Despite their remarkable success in predicting bulk properties, no systematic evaluation has assessed how well these universal MLIPs (uMLIPs) can predict cleavage energies, a critical property governing fracture, catalysis, surface stability, and interfacial phenomena. Here, we present a comprehensive benchmark of 19 state-of-the-art uMLIPs for cleavage energy prediction using our previously established density functional theory (DFT) database of 36,718 slab structures spanning elemental, binary, and ternary metallic compounds. We evaluate diverse architectural paradigms, analyzing their performance across chemical compositions, crystal systems, thickness, and surface orientations. Our results reveal that training data composition dominates architectural sophistication: models trained on the Open Materials 2024 (OMat24) dataset, which emphasizes non-equilibrium configurations, achieve mean absolute percentage errors below 6% and correctly identify the thermodynamically most stable surface terminations in 87% of cases, without any explicit surface energy training. In contrast, architecturally identical models trained on equilibrium-only datasets show five-fold higher errors, while models trained on surface-adsorbate data fail catastrophically with a 17-fold degradation. Remarkably, simpler architectures trained on appropriate data achieve comparable accuracy to complex transformers while offering 10-100x computational speedup. These findings show that the community should focus on strategic training data generation that captures the relevant physical phenomena.

Surface Stability Modeling with Universal Machine Learning Interatomic Potentials: A Comprehensive Cleavage Energy Benchmarking Study

TL;DR

This work addresses the challenge of predicting cleavage energies with universal interatomic potentials. It introduces a large-scale benchmark of 19 uMLIPs across 36,718 DFT-derived slab structures, showing that training data composition, especially non-equilibrium configurations from the Open Materials 2024 dataset, dominates predictive accuracy over architectural sophistication. Substantial performance gains are achieved with relatively simple architectures when trained on diverse data, achieving sub-6% mean absolute percentage error and high accuracy in identifying thermodynamically stable surface terminations without explicit surface-energy training. The findings advocate a data-centric development paradigm for foundational potentials and highlight practical implications for fast, reliable surface and interfacial property predictions in materials design. The study also delineates limitations and future directions, including broader chemistries, higher-index surfaces, and uncertainty quantification, to further empower high-throughput surface science workflows.

Abstract

Machine learning interatomic potentials (MLIPs) have revolutionized computational materials science by bridging the gap between quantum mechanical accuracy and classical simulation efficiency, enabling unprecedented exploration of materials properties across the periodic table. Despite their remarkable success in predicting bulk properties, no systematic evaluation has assessed how well these universal MLIPs (uMLIPs) can predict cleavage energies, a critical property governing fracture, catalysis, surface stability, and interfacial phenomena. Here, we present a comprehensive benchmark of 19 state-of-the-art uMLIPs for cleavage energy prediction using our previously established density functional theory (DFT) database of 36,718 slab structures spanning elemental, binary, and ternary metallic compounds. We evaluate diverse architectural paradigms, analyzing their performance across chemical compositions, crystal systems, thickness, and surface orientations. Our results reveal that training data composition dominates architectural sophistication: models trained on the Open Materials 2024 (OMat24) dataset, which emphasizes non-equilibrium configurations, achieve mean absolute percentage errors below 6% and correctly identify the thermodynamically most stable surface terminations in 87% of cases, without any explicit surface energy training. In contrast, architecturally identical models trained on equilibrium-only datasets show five-fold higher errors, while models trained on surface-adsorbate data fail catastrophically with a 17-fold degradation. Remarkably, simpler architectures trained on appropriate data achieve comparable accuracy to complex transformers while offering 10-100x computational speedup. These findings show that the community should focus on strategic training data generation that captures the relevant physical phenomena.

Paper Structure

This paper contains 16 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview illustration of our cleavage energy benchmarking study highlighting the DFT-based cleavage energy database, uMLIP models considered, and performance benchmarking undertaken.
  • Figure 2: Comparative performance analysis of eight selected uMLIPs for cleavage energy prediction.(a) Ridge plot of APE distributions for 36,718 surface structures, ordered by median APE. Numbers indicate the mode of each distribution. The $x$-axis is truncated at 50%. (b) Dual-axis box plots showing MAPE (solid boxes, left axis) and MAE (hatched boxes, right axis in meV/Å$^2$). (c) Stacked bar chart showing agreement with DFT for identifying thermodynamically most stable surface terminations across 3,699 materials. Green: exact termination match; yellow: correct Miller plane, wrong termination; red: complete disagreement.
  • Figure 3: Detailed performance analysis of UMA--m--1p1--OMat24 for cleavage energy prediction.(a) Hexagonal density parity plot comparing uMLIP-predicted versus DFT-calculated cleavage energies. Note that the color bar is logarithmic. (b) Probability density distribution of prediction errors (uMLIP $-$ DFT), showing a kernel density estimate (KDE) curve calculated using a Gaussian kernel with Scott's rule bandwidth selection. The mean and median error values are displayed in the text box. (c) MAPE as a function of DFT cleavage energy bins. Numbers above bars indicate the sample count in each bin.
  • Figure 4: Decomposition of UMA-m-1p1-OMAT24 prediction errors by element, crystal system, and slab thickness.(a) Periodic table heat map showing MAPE for surfaces containing each element, with viridis colormap normalized to 0--30% range. (b) MAPE distribution across the seven crystal systems with sample counts indicated above bars. (c) MAPE dependency on slab thickness, binned into 5 Å intervals, showing consistent accuracy across different slab thicknesses with sample counts per bin.