Table of Contents
Fetching ...

All that structure matches does not glitter

Maya M. Martirossyan, Thomas Egg, Philipp Hoellmer, George Karypis, Mark Transtrum, Adrian Roitberg, Mingjie Liu, Richard G. Hennig, Ellad B. Tadmor, Stefano Martiniani

TL;DR

The paper shows that existing CSP benchmarks for inorganic crystals are undermined by duplicate structures and polymorphism, which distort performance metrics. It introduces curated datasets and polymorph-aware splits (e.g., carbon-24-unique, carbon-X, carbon-NXL, perov-5-polymorph-split, MP-20-polymorph-split) and new evaluation metrics METRe and cRMSE to properly assess structural diversity and predictive accuracy. Through experiments with DiffCSP, FlowMM, and OMatG, it demonstrates that accounting for polymorphism alters model rankings and that METRe and cRMSE provide more informative benchmarking. The work calls for rigorous dataset design and standardized metrics to advance reliable crystal-structure prediction, and shares datasets and code to facilitate adoption by the community.

Abstract

Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task$\unicode{x2014}$generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $\approx$40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

All that structure matches does not glitter

TL;DR

The paper shows that existing CSP benchmarks for inorganic crystals are undermined by duplicate structures and polymorphism, which distort performance metrics. It introduces curated datasets and polymorph-aware splits (e.g., carbon-24-unique, carbon-X, carbon-NXL, perov-5-polymorph-split, MP-20-polymorph-split) and new evaluation metrics METRe and cRMSE to properly assess structural diversity and predictive accuracy. Through experiments with DiffCSP, FlowMM, and OMatG, it demonstrates that accounting for polymorphism alters model rankings and that METRe and cRMSE provide more informative benchmarking. The work calls for rigorous dataset design and standardized metrics to advance reliable crystal-structure prediction, and shares datasets and code to facilitate adoption by the community.

Abstract

Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction taskgenerating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains 40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms , one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

Paper Structure

This paper contains 36 sections, 5 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Enumerating existing features of datasets and benchmarks used in crystal structure prediction for generative models of inorganic crystals. (a) Two perov-5 structures of composition CaCdSO$_2$, but with different structural prototypes in which structure $b$ is a distorted version of structure $a$. (b) Two perov-5 structures of composition HfNbN$_3$, with the same structural prototype but with the elements at the A and B sites (Hf and Nb) swapped in the perovskite ABX$_3$ structural prototype. (c) Two carbon-24 duplicate structures (one in dark and the other in light gray) with their unit cells marked in red. (d) Three carbon-24 duplicate structures with different unit cell sizes. (e) Views along a lattice direction of (top) a perov-5 test set structure and (bottom) a structure from a generative model which are considered "matching" despite significant structural distortions between the two, calculated using Pymatgen's StructureMatcher module with standard tolerances ltol$=0.3$, stol$=0.5$, angle_tol$=10.0$.
  • Figure 2: Kernel density estimates (with tophat kernel for large plots and Gaussian kernel for insets) of the distributions of match-boundary tolerance and uniqueness fraction for (a) stol, (b) ltol, and (c) angle_tol performed on the carbon-24 dataset. These densities only count structure pairs which are considered matching at or below the maximum tolerances, and ignore structure pairs which are too structurally distinct to match.
  • Figure 3: Demonstrating prior and new benchmarks. (a--d) A toy-case, in which the same colored shapes are considered polymorphs, shows different ways of computing match rate: (a) standard match rate, which penalizes polymorphs in the generated set being out of order; (b) "match everyone" metric, which fixes the fictitious penalty in (a); (c) a case of the "match everyone" metric in which a high match rate can be achieved without generating the diversity of polymorph structures; (d) our solution to the problems posed in (a) and (c), in which the number of matches from the "match everyone" metric is counted with respect to the reference set. (e) A demonstration of how "match everyone" differs when computed with respect to the generated vs. reference structures, showing that only the metric with respect to the reference structures (METRe) catches cases in which none of the generated structures match a given reference structure. (f) The implementation of corrected RMSE on a given matching metric.
  • Figure 4: Tolerance sensitivity plots for OMatG models trained on the polymorph-split MP-20 dataset. (a--b) Best-performing model with linear positional interpolant and ODE sampling and (c--d) worst-performing model with trigonometric interpolant with the latent variable $\gamma$ and ODE sampling. METRe rates are shown for (a) and (c) and cRMSE values are shown for (b) and (d); color bars have equivalently-sized ranges across subfigures. Vertical lines are drawn for clarity.
  • Figure 5: Tophat kernel density estimate of the distributions of match-boundary tolerance and uniqueness fraction for (a) stol, (b) ltol, and (c) angle_tol performed on the carbon-24-unique dataset. These densities only count structure pairs which are considered matching at or below the maximum tolerances, and ignore structure pairs which are too structurally distinct to match.