Table of Contents
Fetching ...

Enhancing Dimensionality Prediction in Hybrid Metal Halides via Feature Engineering and Class-Imbalance Mitigation

Mariia Karabin, Isaac Armstrong, Leo Beck, Paulina Apanel, Markus Eisenbach, David B. Mitzi, Hanna Terletska, Hendrik Heinz

TL;DR

This work tackles the challenge of predicting the structural dimensionality of hybrid metal halides (HMHs) from small, imbalanced datasets. The authors introduce interaction-based feature engineering to capture nonlinear chemical relationships, and apply SMOTE oversampling to balance classes, all within an ensemble stacking framework. The approach yields substantial gains for underrepresented 0D and 1D classes, with robust cross-validation performance and high per-class ROC-AUC. The methods are framed as transferable to other small-data materials problems and offer interpretable insights into the chemical factors governing dimensionality, such as hydrogen bonding and steric effects.

Abstract

We present a machine learning framework for predicting the structural dimensionality of hybrid metal halides (HMHs), including organic-inorganic perovskites, using a combination of chemically-informed feature engineering and advanced class-imbalance handling techniques. The dataset, consisting of 494 HMH structures, is highly imbalanced across dimensionality classes (0D, 1D, 2D, 3D), posing significant challenges to predictive modeling. This dataset was later augmented to 1336 via the Synthetic Minority Oversampling Technique (SMOTE) to mitigate the effects of the class imbalance. We developed interaction-based descriptors and integrated them into a multi-stage workflow that combines feature selection, model stacking, and performance optimization to improve dimensionality prediction accuracy. Our approach significantly improves F1-scores for underrepresented classes, achieving robust cross-validation performance across all dimensionalities.

Enhancing Dimensionality Prediction in Hybrid Metal Halides via Feature Engineering and Class-Imbalance Mitigation

TL;DR

This work tackles the challenge of predicting the structural dimensionality of hybrid metal halides (HMHs) from small, imbalanced datasets. The authors introduce interaction-based feature engineering to capture nonlinear chemical relationships, and apply SMOTE oversampling to balance classes, all within an ensemble stacking framework. The approach yields substantial gains for underrepresented 0D and 1D classes, with robust cross-validation performance and high per-class ROC-AUC. The methods are framed as transferable to other small-data materials problems and offer interpretable insights into the chemical factors governing dimensionality, such as hydrogen bonding and steric effects.

Abstract

We present a machine learning framework for predicting the structural dimensionality of hybrid metal halides (HMHs), including organic-inorganic perovskites, using a combination of chemically-informed feature engineering and advanced class-imbalance handling techniques. The dataset, consisting of 494 HMH structures, is highly imbalanced across dimensionality classes (0D, 1D, 2D, 3D), posing significant challenges to predictive modeling. This dataset was later augmented to 1336 via the Synthetic Minority Oversampling Technique (SMOTE) to mitigate the effects of the class imbalance. We developed interaction-based descriptors and integrated them into a multi-stage workflow that combines feature selection, model stacking, and performance optimization to improve dimensionality prediction accuracy. Our approach significantly improves F1-scores for underrepresented classes, achieving robust cross-validation performance across all dimensionalities.

Paper Structure

This paper contains 14 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a)$ABX_3$ Classic perovskite crystal structure illustrating general arrangement; (b) Examples of structural differences between hybrid metal halide dimensionality classes.
  • Figure 2: Data distribution of structural dimensionality of hybrid metal halides (HMHs)
  • Figure 3: Feature importance analysis of all of the descriptors for HMHs dimensionality prediction.
  • Figure 4: Per-class data distribution before and after SMOTE-based oversampling. The original dataset (blue) shows a strong imbalance favoring 2D structures. After the data augmentation (orange), all four dimensionality classes contain an equal number of samples, enabling better model generalization.
  • Figure 5: Multi-class ROC curves (one-vs-rest) before SMOTE augmentation. Minority class 0 (0D HMHs) performance is low (AUC=0.54), indicating poor classification due to class imbalance.
  • ...and 3 more figures