Table of Contents
Fetching ...

opXRD: Open Experimental Powder X-ray Diffraction Database

Daniel Hollarek, Henrik Schopmans, Jona Östreicher, Jonas Teufel, Bin Cao, Adie Alwen, Simon Schweidler, Mriganka Singh, Tim Kodalle, Hanlin Hu, Gregoire Heymans, Maged Abdelsamie, Arthur Hardiagon, Alexander Wieczorek, Siarhei Zhuk, Ruth Schwaiger, Sebastian Siol, François-Xavier Coudert, Moritz Wolf, Carolin M. Sutter-Fella, Ben Breitung, Andrea M. Hodge, Tong-yi Zhang, Pascal Friederich

TL;DR

The paper addresses the lack of large, open experimental pXRD datasets that hinder automated analysis and transfer from simulated to real data. It introduces opXRD, a growing open database of experimental pXRD diffractograms, comprising $92{,}552$ patterns (including $2{,}179$ labeled and $90{,}373$ unlabeled) from six institutions and multiple measurement geometries, with a workflow to ingest and publish contributions on Zenodo under CC BY $4.0$. The resource includes a Python toolset (opxrd) for easy loading, standardization, visualization, and conversion to ML-ready tensors, plus a Colab notebook for demonstration. By providing both labeled and unlabeled data, opXRD supports benchmarking, transfer learning, and realistic evaluation of pXRD analysis approaches, aiming to bridge the simulated-experimental gap and accelerate automated, high-throughput materials discovery.

Abstract

Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs.

opXRD: Open Experimental Powder X-ray Diffraction Database

TL;DR

The paper addresses the lack of large, open experimental pXRD datasets that hinder automated analysis and transfer from simulated to real data. It introduces opXRD, a growing open database of experimental pXRD diffractograms, comprising patterns (including labeled and unlabeled) from six institutions and multiple measurement geometries, with a workflow to ingest and publish contributions on Zenodo under CC BY . The resource includes a Python toolset (opxrd) for easy loading, standardization, visualization, and conversion to ML-ready tensors, plus a Colab notebook for demonstration. By providing both labeled and unlabeled data, opXRD supports benchmarking, transfer learning, and realistic evaluation of pXRD analysis approaches, aiming to bridge the simulated-experimental gap and accelerate automated, high-throughput materials discovery.

Abstract

Powder X-ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high-throughput discovery in self-driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low-quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X-Ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, e.g. through transfer learning methods. We collected 92552 diffractograms, 2179 of them labeled, from a wide spectrum of materials classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self-driving materials labs.

Paper Structure

This paper contains 5 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Experimental powder X-ray diffraction (pXRD) patterns from several contributors are collected in the opXRD database. The proposed open-access database of experimental data aims to support each step in the pXRD-related machine learning workflow by informing better physics simulations, supplying model training data, and providing a foundation for realistic performance evaluations.
  • Figure 2: Overview of the data collection pipeline. Datasets are submitted using an online submission form, optionally with the help of our submission helper software. After post-processing and data homogenization, we offer the creation of a Zenodo entry for each user submission and subsequently include the submission in the opXRD database.
  • Figure 3: Explained variance ratio over the fraction of the maximum number of components for each dataset contributed to the opXRD database. Here the maximal No. components refers to $N_{\text{max}}$ as defined in equation \ref{['eq:nmax']}. Datasets contributed by the same institution are labeled alphabetically in the order in which they are described in the texts towards the end of this section.
  • Figure 4: Histograms detailing the distribution of pattern and structure properties in the opXRD database: a) distribution of spacegroups present in labeled data; b) distribution of angular resolution in all data; c) distribution of smallest and largest recorded $2\theta$ values for all data.