Table of Contents
Fetching ...

AlphaDiffract: Automated Crystallographic Analysis of Powder X-ray Diffraction Data

Nina Andrejevic, Ming Du, Hemant Sharma, James P. Horwath, Aileen Luo, Xiangyu Yin, Michael Prince, Brian H. Toby, Mathew J. Cherukara

Abstract

Materials identification and structural understanding from powder X-ray diffraction (PXRD) data is a long-standing challenge in materials science, fundamental to discovering and characterizing novel materials. A prerequisite for full structure solution is the accurate determination of the crystal lattice, including lattice parameters and crystallographic symmetries. Traditional methods for this are iterative and typically require expert input, and while existing deep learning approaches have shown promise, a robust, single-shot method for comprehensive lattice determination from experimental data remains a key goal. Here, we introduce AlphaDiffract, a deep learning framework that achieves state-of-the-art performance in predicting the crystal system, space group, and lattice parameters directly from PXRD patterns. AlphaDiffract utilizes a 1D adaptation of the ConvNeXt architecture, a modern convolutional neural network that integrates key design principles from transformers, coupled with dedicated prediction heads for each crystallographic property. The model is trained on the largest-to-date physics-based dataset of over 31 million simulated diffraction patterns, generated by augmenting 312,267 curated structures from the ICSD and Materials Project databases. Crucially, it demonstrates strong generalization to experimental data, achieving 81.7% crystal system accuracy and 66.2% space group accuracy on the RRUFF dataset while additionally predicting all six lattice parameters. By providing a unified model for rapid and accurate lattice determination from PXRD data, AlphaDiffract represents a significant step forward in leveraging deep learning for high-throughput materials discovery.

AlphaDiffract: Automated Crystallographic Analysis of Powder X-ray Diffraction Data

Abstract

Materials identification and structural understanding from powder X-ray diffraction (PXRD) data is a long-standing challenge in materials science, fundamental to discovering and characterizing novel materials. A prerequisite for full structure solution is the accurate determination of the crystal lattice, including lattice parameters and crystallographic symmetries. Traditional methods for this are iterative and typically require expert input, and while existing deep learning approaches have shown promise, a robust, single-shot method for comprehensive lattice determination from experimental data remains a key goal. Here, we introduce AlphaDiffract, a deep learning framework that achieves state-of-the-art performance in predicting the crystal system, space group, and lattice parameters directly from PXRD patterns. AlphaDiffract utilizes a 1D adaptation of the ConvNeXt architecture, a modern convolutional neural network that integrates key design principles from transformers, coupled with dedicated prediction heads for each crystallographic property. The model is trained on the largest-to-date physics-based dataset of over 31 million simulated diffraction patterns, generated by augmenting 312,267 curated structures from the ICSD and Materials Project databases. Crucially, it demonstrates strong generalization to experimental data, achieving 81.7% crystal system accuracy and 66.2% space group accuracy on the RRUFF dataset while additionally predicting all six lattice parameters. By providing a unified model for rapid and accurate lattice determination from PXRD data, AlphaDiffract represents a significant step forward in leveraging deep learning for high-throughput materials discovery.
Paper Structure (26 sections, 13 equations, 11 figures, 8 tables)

This paper contains 26 sections, 13 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: AlphaDiffract model architecture. The AlphaDiffract model consists of a 1D ConvNeXt backbone that processes input PXRD patterns through a series of ConvNeXt blocks with progressive downsampling. The composition of each ConvNeXt block is indicated in the bottom left inset. The extracted features are fed into three separate prediction heads: a crystal system (CS) classifier, a space group (SG) classifier, and a lattice parameter (LP) regressor. Each head employs a multi-layer perceptron architecture with layer dimensions as indicated.
  • Figure 1: Distribution of crystal systems and space groups across crystallographic databases and structural uniqueness analysis.a. ICSD database showing the distribution of crystal systems (inner ring) and their associated space groups (outer ring) with color intensity indicating the number and percentage of structures. b. Materials Project database displaying the same hierarchical representation of crystal systems and space groups. c. Pie chart quantifying structural uniqueness and redundancy across ICSD and Materials Project databases, where "unique" structures are crystallographically distinct and "equivalent" structures are identified as similar to a unique structure based on the structure similarity metric detailed in Section \ref{['subsec:similarity']}. d. RRUFF database showing crystal system and space group distributions following the same visualization scheme as panels a and b. The seven crystal systems are: 1-triclinic, 2-monoclinic, 3-orthorhombic, 4-tetragonal, 5-trigonal, 6-hexagonal, and 7-cubic. Color bars indicate both absolute counts and relative percentages for crystal systems and space groups in each database.
  • Figure 2: Evaluation of space group predictions using Graph Earth Mover's Distance.a. Illustration of true ($y_\text{SG}$) and predicted ($\hat{y}_\text{SG}$) space group probability distributions on a representative subgroup graph, where nodes represent space groups and edges indicate maximal subgroup relationships. The true label assigns probability 1 to a single node (yellow), while the predicted distribution typically spreads probability across multiple nodes. b. Distance matrix computed from maximal subgroup relationships between space groups, where color intensity indicates the minimum number of graph edges connecting each pair of space groups. The vertical vector ($y_\text{SG}$) represents a one-hot encoded true label that selects the corresponding row for calculating the GEMD loss against predicted distributions. c-e. Distribution of prediction errors as a function of graph distance (number of edges) from the true space group for three datasets: c. ICSD validation set, d. Materials Project validation set, and e. RRUFF test set. Filled bars show the percentage of all predictions (including correct predictions at distance 0) that fall at each graph distance from the true space group, for three different GEMD loss weights ($\mu$ = 0, 1, 2). Distance zero indicates a correct prediction (predicted space group = true space group). Unfilled bars with labeled values indicate the corresponding cumulative percentages up to and including that distance. With higher $\mu$ values, predictions become increasingly concentrated at shorter graph distances from the true space group.
  • Figure 2: Distribution of lattice lengths across crystallographic databases.a. Histogram of Niggli reduced cell lattice lengths (a, b, c) for structures in the final ICSD dataset. b. Corresponding lattice length distributions for structures in the final Materials Project dataset. c. Lattice length distributions for structures in the RRUFF dataset used for regression.
  • Figure 3: AlphaDiffract ensemble model performance.a. Crystal system (CS) and space group (SG) prediction accuracies on the RRUFF dataset as a function of ensemble size. Error bars represent the uncertainty in model predictions within the ensemble. b. Distribution of prediction errors as a function of graph distance (number of edges) from the true space group for the 10-model ensemble evaluated on the three datasets. Unfilled bars with labeled values indicate the cumulative percentage of predictions falling within that graph distance of the true space group (i.e., the sum of all filled bars up to and including that distance). c-e. Parity plots comparing predicted versus true lattice parameters for the 10-model ensemble across three datasets: c. ICSD, d. Materials Project, and e. RRUFF. Each panel shows predictions for the three lattice lengths ($a$, $b$, $c$; top row) and three lattice angles ($\alpha$, $\beta$, $\gamma$; bottom row). Dashed lines indicate perfect agreement. Heat map coloring represents point density. $R^{2}$ values indicate goodness of fit.
  • ...and 6 more figures