Table of Contents
Fetching ...

Machine Learning-Driven Crystal System Prediction for Perovskites Using Augmented X-ray Diffraction Data

Ansu Mathew, Ahmer A. B. Baloch, Alamin Yakasai, Hemant Mittal, Vivian Alberts, Jayakumar V. Karunamurthy

TL;DR

This work addresses rapid and automated symmetry classification of perovskite XRD patterns by treating XRD spectra as sequential data and applying a Time Series Forest with augmentation (e.g., SMOTE and jittering). Using a curated dataset from the Materials Project with crystallography-derived labels, the authors demonstrate high accuracy across crystal system, point group, and space group predictions, particularly for high-symmetry classes, while also analyzing compositional trends that influence symmetry distributions. The approach achieves strong performance under imbalanced conditions with data-efficient models, offering a scalable pipeline for high-throughput materials discovery and autonomous experimentation. The study provides a reproducible framework, including data provenance and a clear pathway for extending to multi-modal inputs and experimental XRD data.

Abstract

Prediction of crystal system from X-ray diffraction (XRD) spectra is a critical task in materials science, particularly for perovskite materials which are known for their diverse applications in photovoltaics, optoelectronics, and catalysis. In this study, we present a machine learning (ML)-driven framework that leverages advanced models, including Time Series Forest (TSF), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and a simple feedforward neural network (NN), to classify crystal systems, point groups, and space groups from XRD data of perovskite materials. To address class imbalance and enhance model robustness, we integrated feature augmentation strategies such as Synthetic Minority Over-sampling Technique (SMOTE), class weighting, jittering, and spectrum shifting, along with efficient data preprocessing pipelines. The TSF model with SMOTE augmentation achieved strong performance for crystal system prediction, with a Matthews correlation coefficient (MCC) of 0.9, an F1 score of 0.92, and an accuracy of 97.76%. For point and space group prediction, balanced accuracies above 95% were obtained. The model demonstrated high performance for symmetry-distinct classes, including cubic crystal systems, point groups 3m and m-3m, and space groups Pnma and Pnnn. This work highlights the potential of ML for XRD-based structural characterization and accelerated discovery of perovskite materials

Machine Learning-Driven Crystal System Prediction for Perovskites Using Augmented X-ray Diffraction Data

TL;DR

This work addresses rapid and automated symmetry classification of perovskite XRD patterns by treating XRD spectra as sequential data and applying a Time Series Forest with augmentation (e.g., SMOTE and jittering). Using a curated dataset from the Materials Project with crystallography-derived labels, the authors demonstrate high accuracy across crystal system, point group, and space group predictions, particularly for high-symmetry classes, while also analyzing compositional trends that influence symmetry distributions. The approach achieves strong performance under imbalanced conditions with data-efficient models, offering a scalable pipeline for high-throughput materials discovery and autonomous experimentation. The study provides a reproducible framework, including data provenance and a clear pathway for extending to multi-modal inputs and experimental XRD data.

Abstract

Prediction of crystal system from X-ray diffraction (XRD) spectra is a critical task in materials science, particularly for perovskite materials which are known for their diverse applications in photovoltaics, optoelectronics, and catalysis. In this study, we present a machine learning (ML)-driven framework that leverages advanced models, including Time Series Forest (TSF), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and a simple feedforward neural network (NN), to classify crystal systems, point groups, and space groups from XRD data of perovskite materials. To address class imbalance and enhance model robustness, we integrated feature augmentation strategies such as Synthetic Minority Over-sampling Technique (SMOTE), class weighting, jittering, and spectrum shifting, along with efficient data preprocessing pipelines. The TSF model with SMOTE augmentation achieved strong performance for crystal system prediction, with a Matthews correlation coefficient (MCC) of 0.9, an F1 score of 0.92, and an accuracy of 97.76%. For point and space group prediction, balanced accuracies above 95% were obtained. The model demonstrated high performance for symmetry-distinct classes, including cubic crystal systems, point groups 3m and m-3m, and space groups Pnma and Pnnn. This work highlights the potential of ML for XRD-based structural characterization and accelerated discovery of perovskite materials
Paper Structure (21 sections, 5 equations, 7 figures, 11 tables)

This paper contains 21 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: ML pipeline for crystallographic classification, from data collection and preprocessing to feature augmentation and model training, predicting crystal system, point group, and space group
  • Figure 2: Distribution of perovskite composition types: Oxide, Halide, Mixed, and Other—across the top 15 symmetry classes.
  • Figure 3: Data distribution of crystal system, point group, and space group: (a) Histogram plot of the crystal system with a pie chart on the left showing the percentage contribution of each class; (b) Histogram plot of the point group, where the dashed line separates the classes chosen from the entire dataset. Classes up to point group 2, which has 23 values, were selected; (c) Histogram plot of the space group, where the initial 15 classes up to space group 'Pc', with a value count of 30, were considered for ML model development.
  • Figure 4: Data preprocessing Techniques for XRD patterns. Green line represents the original XRD data and red line represents augmented data(a) Interpolation: Comparison of original XRD values with interpolated values. (b) Jittering: Visualization of original and jittered XRD values. (c) Scaling: Comparison of original and scaled XRD values. (d) Shifting: Visualization of original and shifted XRD values. Each method demonstrates the effect of data augmentation on XRD patterns, preserving the overall intensity and peak positions.
  • Figure 5: Performance metrics of the classification model for crystal system prediction using XRD data. Metrics include Precision, Recall, F1 Score, Matthews Correlation Coefficient (MCC), and Binary Accuracy for each crystal system. X-axis classes are sorted by sample ratio, with percentages shown in each label.
  • ...and 2 more figures