All that structure matches does not glitter

Maya M. Martirossyan; Thomas Egg; Philipp Hoellmer; George Karypis; Mark Transtrum; Adrian Roitberg; Mingjie Liu; Richard G. Hennig; Ellad B. Tadmor; Stefano Martiniani

All that structure matches does not glitter

Maya M. Martirossyan, Thomas Egg, Philipp Hoellmer, George Karypis, Mark Transtrum, Adrian Roitberg, Mingjie Liu, Richard G. Hennig, Ellad B. Tadmor, Stefano Martiniani

TL;DR

The paper shows that existing CSP benchmarks for inorganic crystals are undermined by duplicate structures and polymorphism, which distort performance metrics. It introduces curated datasets and polymorph-aware splits (e.g., carbon-24-unique, carbon-X, carbon-NXL, perov-5-polymorph-split, MP-20-polymorph-split) and new evaluation metrics METRe and cRMSE to properly assess structural diversity and predictive accuracy. Through experiments with DiffCSP, FlowMM, and OMatG, it demonstrates that accounting for polymorphism alters model rankings and that METRe and cRMSE provide more informative benchmarking. The work calls for rigorous dataset design and standardized metrics to advance reliable crystal-structure prediction, and shares datasets and code to facilitate adoption by the community.

Abstract

Generative models for materials, especially inorganic crystals, hold potential to transform the theoretical prediction of novel compounds and structures. Advancement in this field depends on robust benchmarks and minimal, information-rich datasets that enable meaningful model evaluation. This paper critically examines common datasets and reported metrics for a crystal structure prediction task$\unicode{x2014}$generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains $\approx$40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms $N$, one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

All that structure matches does not glitter

TL;DR

Abstract

generating the most likely structures given the chemical composition of a material. We focus on three key issues: First, materials datasets should contain unique crystal structures; for example, we show that the widely-utilized carbon-24 dataset only contains

40% unique structures. Second, materials datasets should not be split randomly if polymorphs of many different compositions are numerous, which we find to be the case for the perov-5 and MP-20 datasets. Third, benchmarks can mislead if used uncritically, e.g., reporting a match rate metric without considering the structural variety exhibited by identical building blocks. To address these oft-overlooked issues, we introduce several fixes. We provide revised versions of the carbon-24 dataset: one with duplicates removed, one deduplicated and split by number of atoms

, one with enantiomorphs, and two containing only identical structures but with different unit cells. We also propose new splits for datasets with polymorphs, ensuring that polymorphs are grouped within each split subset, setting a more sensible standard for benchmarking model performance. Finally, we present METRe and cRMSE, new model evaluation metrics that can correct existing issues with the match rate metric.

All that structure matches does not glitter

TL;DR

Abstract

All that structure matches does not glitter

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)