Table of Contents
Fetching ...

Li-P-S Electrolyte Materials as a Benchmark for Machine-Learned Interatomic Potentials

Natascia L. Fragapane, Volker L. Deringer

Abstract

With the growing availability of machine-learned interatomic potential (MLIP) models for materials simulations, there is an increasing demand for robust, automated, and chemically insightful benchmarking methodologies. In response, we here introduce LiPS-25, a curated benchmark dataset for a canonical series of solid-state electrolyte materials from the Li2S-P2S5 pseudo-binary compositional line, including crystalline and amorphous configurations. Together with the dataset, we present a suite of performance tests that range from conventional numerical error metrics to physically motivated evaluation tasks. With a focus on graph-based MLIP architectures, we run numerical experiments that assess (i) the effect of hyperparameters and (ii) the fine-tuning behavior of selected pre-trained ("foundational") MLIP models. Beyond the Li-P-S solid-state electrolytes, we expect that such benchmarks and their code implementations can be readily adapted to other material systems.

Li-P-S Electrolyte Materials as a Benchmark for Machine-Learned Interatomic Potentials

Abstract

With the growing availability of machine-learned interatomic potential (MLIP) models for materials simulations, there is an increasing demand for robust, automated, and chemically insightful benchmarking methodologies. In response, we here introduce LiPS-25, a curated benchmark dataset for a canonical series of solid-state electrolyte materials from the Li2S-P2S5 pseudo-binary compositional line, including crystalline and amorphous configurations. Together with the dataset, we present a suite of performance tests that range from conventional numerical error metrics to physically motivated evaluation tasks. With a focus on graph-based MLIP architectures, we run numerical experiments that assess (i) the effect of hyperparameters and (ii) the fine-tuning behavior of selected pre-trained ("foundational") MLIP models. Beyond the Li-P-S solid-state electrolytes, we expect that such benchmarks and their code implementations can be readily adapted to other material systems.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The LiPS-25 dataset. (a) Ternary diagram of the Li--P--S system, with the tie-line between Li2S and P2S5 indicated. Green circles mark compositions with known crystalline phases; compositions in bold were used to build the LiPS-25 dataset. Key structural motifs are displayed below: ortho-thiophosphate, [PS4]$^{3-}$; pyro-thiophosphate, [P2S7]$^{4-}$; and hypo-thiophosphate, [P2S6]$^{4-}$. (b) Scatter plot for the LiPS-25 dataset showing the fraction of Li in each structure ($x$-axis) versus energy above the convex hull ($y$-axis); a dashed line at $y=0$ has been added. Dimer configurations are excluded from this plot. Representative structures are shown (atomic color coding: Li, green; P, purple; S, yellow), visualized with VESTA momma_vesta_2011; colored outlines act as a legend for the scatter plot, with purple for crystalline structures, teal for AIMD snapshots, orange/red for iterative melt–quench configurations (Iter1-$x$, Iter2-$x$), and turquoise for random hard spheres.
  • Figure 2: Benchmark tasks. (a) Overview of validation techniques for MLIPs, which fall into two groups: "static" validation, assessing numerical errors and basic energetic profiles, and "dynamic" validation based on MD simulations, leveraging domain expertise to evaluate MLIP performance. From these, four domain-specific benchmarking tasks have been selected to accompany the LiPS-25 dataset, facilitating physically motivated evaluation of MLIPs. (b) Task 1: Energetic accuracy. The MLIP is used to predict formation energies of 8 crystalline structures along the Li2S--P2S5 tie-line. These predictions are compared against ground-truth values to compute the RMSE(E$_{f}$) metric. (c) Task 2: Domain-specific force accuracy. Evenly-spaced snapshots are taken from an NPT melt--quench simulation of Li7P3S11. Force errors are calculated with respect to DFT, and aggregated into the RMSE(F) metric. (d) Task 3: Property accuracy. The room-temperature ionic conductivity of the Li7P3S11 crystal, a known superionic conductor, is predicted; the inset illustrates Li ion migration, visualized with VESTA momma_vesta_2011. (e) Task 4: Robustness. NPT simulations across a grid of temperatures and pressures are run and assessed by simulation survival and by the number of close-contact events.
  • Figure 3: Benchmark task performance for a MACE hyperparameter sweep. (a) Performance for formation energies (Task 1) and domain-specific forces (Task 2), reported as the mean of five training repeats. Each value reflects the model's performance under a different hyperparameter configuration, with the metric corresponding to a task-specific prediction error. (b) Performance for the predicted magnitude of ionic conductivity, $\sigma_{\text{298}}$, averaged over three repeats (Task 3). As in (a), results reflect variations in model architecture arising from the hyperparameter sweep. In both panels, the boxes outlined in bold indicate the model using a 6 Å cutoff selected from the initial sweep of cutoff radii.
  • Figure 4: Robustness evaluation of representative MLIP models (Task 4). (a) Example trajectory of an NPT annealing simulation, showing the initial relaxed random-hard-sphere structure and two subsequent configurations illustrating expansion and vaporization under high-temperature / -pressure conditions (atomic color coding: Li, green; P, purple; S, yellow). All structures are drawn to scale and were visualized with OVITO stukowski_visualization_2009. (b) Grid-search results for three exemplar models: a from-scratch MACE model trained with a 6 Å radial cutoff (from the hyperparameter sweep in Fig. \ref{['fig:figure_3']}), and the foundation models MACE-MP-0b3 and MACE-OMAT-0. For each model, 100 ps NPT anneals were performed across a grid of temperatures (1000–16,000 K) and pressures ($10^{6}$–$10^{12}$ Pa) with three repeats. Boxes are shaded green if all repeats reached 100 ps, pale green if one or two repeats reached 100 ps, and white if all repeats failed. The percentages inside the boxes denote the fraction of frames (sampled every 1 ps) with interatomic separations of $\leq 1$ Å. A dash ("--") indicates that all three repeats failed before 1 ps, i.e., no frames were available for evaluation.
  • Figure 5: Accuracy of foundation models fine-tuned on 25 Li7P3S11 structures from LiPS-25. We assess these models on a Li7P3S11 melt--quench trajectory (Task 2), showing energy ( top) and force ( bottom) errors against DFT-labeled snapshots. Errors for fine-tuned models (solid lines) are averaged over 5 models fine-tuned with different seeds. Zero-shot errors (dashed lines), corresponding to models evaluated without fine-tuning, are shown for comparison. For these zero-shot models, energy predictions were corrected using the add_auto_offset feature of graph-pes to account for differences in reference atomic energies between the pretraining datasets and LiPS-25, arising from the use of different exchange--correlation functionals and pseudopotentials. (a) Schematic for an atomistic ML model, showing the mapping between the dataset and the model architecture. (b) Performance of MACE foundation models: assessing the effect of differences in training dataset for similar architecture. (c) Performance of other foundation model families, viz. MatterSim and Orb: assessing differences in model architecture.
  • ...and 1 more figures