Leveraging Composition-Based Material Descriptors for Machine Learning Optimization
Giovanni Trezza, Eliodoro Chiavazzo
TL;DR
This work tackles descriptor efficiency and predictive power in ML-guided discovery of low-$T_{ m{c}}$ superconductors by (i) building a large, curated dataset from SuperCon and annotating each compound with 81-145 composition-based descriptors, (ii) using SHAP to identify key features and attempting invariant-group analyses via a DNN gradient, (iii) introducing a maximum-entropy QEG classifier and comparing it to standard classifiers, and (iv) developing a multi-objective, Pareto-front approach to construct mixed features that enhance class separation. The results show no robust invariants in binary feature groups, while mixed-feature representations and entropy-based Bayesian classifiers improve classification performance; Extra Trees models with broader feature sets often outperform entropy-based approaches, and SMOTE further stabilizes performance on imbalanced data. Collectively, the paper provides a generalizable framework for descriptor reduction and classifier design that can accelerate discovery in energy materials and beyond, with open access to data and code. The methods emphasize principled feature engineering (SHAP-guided ranking, dimensionless mixing, Pareto optimization) and probabilistic classification that can be adapted to other material-property targets.
Abstract
In this study, we evaluate several classifiers and focus on selecting a minimal set of appropriate material features. Our objective is to propose and discuss general strategies for reducing the number of descriptors required for material classification. The first strategy involves testing whether the critical temperature of the target material property is invariant with respect to binary groups of composition-based features. We also propose a multi-objective optimization procedure to reduce the set of composition-based material descriptors. The latter procedure is found to be particularly useful when applied to Bayesian classifiers. We test the proposed strategies focusing on low-temperature superconductors material data extracted from a public database.
