Machine Learning-Driven Crystal System Prediction for Perovskites Using Augmented X-ray Diffraction Data
Ansu Mathew, Ahmer A. B. Baloch, Alamin Yakasai, Hemant Mittal, Vivian Alberts, Jayakumar V. Karunamurthy
TL;DR
This work addresses rapid and automated symmetry classification of perovskite XRD patterns by treating XRD spectra as sequential data and applying a Time Series Forest with augmentation (e.g., SMOTE and jittering). Using a curated dataset from the Materials Project with crystallography-derived labels, the authors demonstrate high accuracy across crystal system, point group, and space group predictions, particularly for high-symmetry classes, while also analyzing compositional trends that influence symmetry distributions. The approach achieves strong performance under imbalanced conditions with data-efficient models, offering a scalable pipeline for high-throughput materials discovery and autonomous experimentation. The study provides a reproducible framework, including data provenance and a clear pathway for extending to multi-modal inputs and experimental XRD data.
Abstract
Prediction of crystal system from X-ray diffraction (XRD) spectra is a critical task in materials science, particularly for perovskite materials which are known for their diverse applications in photovoltaics, optoelectronics, and catalysis. In this study, we present a machine learning (ML)-driven framework that leverages advanced models, including Time Series Forest (TSF), Random Forest (RF), Extreme Gradient Boosting (XGBoost), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and a simple feedforward neural network (NN), to classify crystal systems, point groups, and space groups from XRD data of perovskite materials. To address class imbalance and enhance model robustness, we integrated feature augmentation strategies such as Synthetic Minority Over-sampling Technique (SMOTE), class weighting, jittering, and spectrum shifting, along with efficient data preprocessing pipelines. The TSF model with SMOTE augmentation achieved strong performance for crystal system prediction, with a Matthews correlation coefficient (MCC) of 0.9, an F1 score of 0.92, and an accuracy of 97.76%. For point and space group prediction, balanced accuracies above 95% were obtained. The model demonstrated high performance for symmetry-distinct classes, including cubic crystal systems, point groups 3m and m-3m, and space groups Pnma and Pnnn. This work highlights the potential of ML for XRD-based structural characterization and accelerated discovery of perovskite materials
