Leveraging Composition-Based Material Descriptors for Machine Learning Optimization

Giovanni Trezza; Eliodoro Chiavazzo

Leveraging Composition-Based Material Descriptors for Machine Learning Optimization

Giovanni Trezza, Eliodoro Chiavazzo

TL;DR

This work tackles descriptor efficiency and predictive power in ML-guided discovery of low-$T_{ m{c}}$ superconductors by (i) building a large, curated dataset from SuperCon and annotating each compound with 81-145 composition-based descriptors, (ii) using SHAP to identify key features and attempting invariant-group analyses via a DNN gradient, (iii) introducing a maximum-entropy QEG classifier and comparing it to standard classifiers, and (iv) developing a multi-objective, Pareto-front approach to construct mixed features that enhance class separation. The results show no robust invariants in binary feature groups, while mixed-feature representations and entropy-based Bayesian classifiers improve classification performance; Extra Trees models with broader feature sets often outperform entropy-based approaches, and SMOTE further stabilizes performance on imbalanced data. Collectively, the paper provides a generalizable framework for descriptor reduction and classifier design that can accelerate discovery in energy materials and beyond, with open access to data and code. The methods emphasize principled feature engineering (SHAP-guided ranking, dimensionless mixing, Pareto optimization) and probabilistic classification that can be adapted to other material-property targets.

Abstract

In this study, we evaluate several classifiers and focus on selecting a minimal set of appropriate material features. Our objective is to propose and discuss general strategies for reducing the number of descriptors required for material classification. The first strategy involves testing whether the critical temperature of the target material property is invariant with respect to binary groups of composition-based features. We also propose a multi-objective optimization procedure to reduce the set of composition-based material descriptors. The latter procedure is found to be particularly useful when applied to Bayesian classifiers. We test the proposed strategies focusing on low-temperature superconductors material data extracted from a public database.

Leveraging Composition-Based Material Descriptors for Machine Learning Optimization

TL;DR

This work tackles descriptor efficiency and predictive power in ML-guided discovery of low-

superconductors by (i) building a large, curated dataset from SuperCon and annotating each compound with 81-145 composition-based descriptors, (ii) using SHAP to identify key features and attempting invariant-group analyses via a DNN gradient, (iii) introducing a maximum-entropy QEG classifier and comparing it to standard classifiers, and (iv) developing a multi-objective, Pareto-front approach to construct mixed features that enhance class separation. The results show no robust invariants in binary feature groups, while mixed-feature representations and entropy-based Bayesian classifiers improve classification performance; Extra Trees models with broader feature sets often outperform entropy-based approaches, and SMOTE further stabilizes performance on imbalanced data. Collectively, the paper provides a generalizable framework for descriptor reduction and classifier design that can accelerate discovery in energy materials and beyond, with open access to data and code. The methods emphasize principled feature engineering (SHAP-guided ranking, dimensionless mixing, Pareto optimization) and probabilistic classification that can be adapted to other material-property targets.

Abstract

Paper Structure (16 sections, 12 equations, 11 figures, 3 tables)

This paper contains 16 sections, 12 equations, 11 figures, 3 tables.

Introduction
Methods
Dataset creation
Regression models and descriptors choice
Invariant groups identification procedure
QEG-based probabilistic classifier
Results and discussion
Models for predicting the critical temperature value
Invariant groups
Entropy-based binary classifiers
Other standard binary classifiers
Optimal reduction of the composition-based material descriptors
Application to one- and two-dimensional cases
Possible generalizations
Entropy-, tree-, and Bayes-based binary classifiers on the new mixed features
...and 1 more sections

Figures (11)

Figure 1: Overview of the protocol used to find a reduced set of ruling descriptors for conventional superconductivity and for the construction of optimized mixed features. Over 7000 chemical compositions have been featurized with 145 descriptors. A regression model has been trained and validated over this dataset, and during the pre-processing routines (i.e., feature reduction by means of linear correlation analysis, descriptors variance analysis, correlation analysis with the $T_{\rm{c}}$, see Supplementary Note 4 for details), many of those features are discarded, ending up with 81 descriptors. By means of SHAP, those 81 features are ranked in terms of importance. The work aimed at finding optimized mixed features for both regression/classification in the form $x_i^ax_j^b$; and for classification, with power or linear combination of the primitive features. The latter descriptors have been tested over both new entropy-based classifiers and other classifiers.
Figure 2: Predictions and corresponding normalized cumulative curve for the coefficients of importance of the ETR model. Model performances are shown in terms of coefficient of determination $R^2$, mean absolute error (MAE), and root mean squared error (RMSE), with the size of training and testing sets $N_{\rm{train}}$ and $N_{\rm{test}}$, respectively.
Figure 3: The five most important features according to SHAP ranking for $T_{\rm{c}}$. For each feature (i.e., each line), 1084 dots are shown, representing the entire testing sets used for computing the related SHAP values (impacts on the model output, horizontal axes); the color represents the corresponding feature value, the features are sorted according to the mean over the absolute SHAP values.
Figure 4: Predictions over the testing set and corresponding loss curves for the DNN regression model. Model performances are shown in terms of coefficient of determination $R^2$, mean absolute error (MAE), and root mean squared error (RMSE), with the sizes of the training, the validation and the testing sets, $N_{\textrm{train}}$, $N_{\textrm{val}}$, $N_{\textrm{test}}$ respectively.
Figure 5: Probabilistic classifier. a 2-dimensional binning, with 10 bins for the first variable and 5 bins for the second, of the two most relevant features $x_1, x_2$ according to the SHAP ranking for superconductors showing $T_{\rm{c}}<15\, K$ and $T_{\rm{c}}\geq15\, K$ respectively among the training set (namely, 85% of materials); b QEG solution of corresponding maximum Shannon entropy probability distribution; c QEG solution of corresponding maximum Shannon entropy probability distribution, bagged case.
...and 6 more figures

Leveraging Composition-Based Material Descriptors for Machine Learning Optimization

TL;DR

Abstract

Leveraging Composition-Based Material Descriptors for Machine Learning Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (11)