Table of Contents
Fetching ...

Accurate predictive model of band gap with selected important features based on explainable machine learning

Joohwi Lee, Kaito Miyamoto

TL;DR

XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles is highlighted, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.

Abstract

In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to, or even deteriorate, model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets (0.254 vs. 0.247 eV) while demonstrating superior generalization with lower prediction errors on out-of-domain data (0.461 vs. 0.341 eV). Additionally, the study underscores the necessity for eliminating strongly correlated features (correlation coefficient greater than 0.8) to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.

Accurate predictive model of band gap with selected important features based on explainable machine learning

TL;DR

XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles is highlighted, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.

Abstract

In the rapidly advancing field of materials informatics, nonlinear machine learning models have demonstrated exceptional predictive capabilities for material properties. However, their black-box nature limits interpretability, and they may incorporate features that do not contribute to, or even deteriorate, model performance. This study employs explainable ML (XML) techniques, including permutation feature importance and the SHapley Additive exPlanation, applied to a pristine support vector regression model designed to predict band gaps at the GW level using 18 input features. Guided by XML-derived individual feature importance, a simple framework is proposed to construct reduced-feature predictive models. Model evaluations indicate that an XML-guided compact model, consisting of the top five features, achieves comparable accuracy to the pristine model on in-domain datasets (0.254 vs. 0.247 eV) while demonstrating superior generalization with lower prediction errors on out-of-domain data (0.461 vs. 0.341 eV). Additionally, the study underscores the necessity for eliminating strongly correlated features (correlation coefficient greater than 0.8) to prevent misinterpretation and overestimation of feature importance before applying XML. This study highlights XML's effectiveness in developing simplified yet highly accurate machine learning models by clarifying feature roles, thereby reducing computational costs for feature acquisition and enhancing model trustworthiness for materials discovery.

Paper Structure

This paper contains 18 sections, 5 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: XML importance scores of SVR regression model for predicting $E_\textrm{g}^{\textrm{GW}}$ using 11-feature set. For each feature, the paired bars represent the PFI (left bar) and SHAP importance (right bar). The PFI score is calculated as the increase in the RMSE for the analysis using the test dataset when the values of a specific feature are shuffled, where the predictive model is trained using the training dataset. The SHAP importance score is calculated as the mean of the absolute values of the individual SHAP values from the test dataset. If the signs of $r_p$ for the SHAP and predicted values are consistently positive or negative across the predictive models using 20 different data selections, the bars are colored red and blue, respectively. The hatched bars indicate that the signs vary across the predictive models. The error bars represent one standard deviation of the XML importance scores across the predictive models constructed using 20 different data selections. The features are displayed in descending order based on the average PFI and SHAP importance scores.
  • Figure 2: SVR regression models for $E_\textrm{g}^\textrm{GW}$ prediction using various feature sets. (a) Dependence of the RMSE for the test in-domain dataset (cyan rectangles) and generalization gap (orange $\times$, right vertical axis) on the number of features selected based on XML importance scores. (b) Dependence of the RMSE for the OOD dataset (cyan rectangles) and the predicted value deviations (orange $\times$, right vertical axis) on the number of features selected based on the XML importance scores using 20 different data selections. The values at $n_x$ = 18 correspond to the pristine model. Error bars indicate one standard deviation across the predictive models using 20 different data selections. For $n_x$ = 3 to 10, 10 predictive models with random feature sets, including $E_\textrm{g}^\textrm{PBE}$, were constructed, represented by empty circles. In addition, models with low RMSE values for the in-domain test dataset (<0.30 eV) are represented by green circles in panels (a) and (b). Values exceeding the range of the vertical axis are not displayed. LASSO results are also shown for comparison; here the horizontal axis indicates the number of input features, and the feature order for $n_x$ = 2 to 11 is determined by ranking the coefficients in the 11-feature LASSO model by their absolute magnitude. Parity plots for the 30 OOD data points: (c) Pristine model with 18-feature set and (d) predictive model with 5-feature set. Each dot represents the predicted values from the predictive models with 20 different data selections. The parity plots for the models with all the other feature sets are displayed in Supplementary Fig. \ref{['fig:allfeatures-deviations']}.
  • Figure 3: (a) Top five features with SHAP importance scores for SVR regression model for $E_\textrm{g}^\textrm{GW}$ prediction using 18-feature set. The SHAP importance scores for all 18 features are provided in Supplementary Fig. \ref{['fig:shap17']}. Detailed information relevant to most options, such as error bars and colors for the bar graph, is presented in Fig. \ref{['fig:svrpfishap']}. Relationships between (b) $\sigma(Z)$ and $\sigma(m)$, and (c) SHAP values for $\sigma(Z)$ and $\sigma(m)$ of test dataset.
  • Figure S1: Comparison of in-domain (270 compounds composed of binary and ternary systems) and out-of-domain (OOD) datasets (30 compounds containing transition metals and/or quaternary/pentanary systems) for 18 material features. Each subplot shows the probability density distributions of the corresponding feature for both datasets. The Kolmogorov–Smirnov test is used to determine whether the two distributions originate from the same population. The $p$-value is shown in each panel; values smaller than 0.01, shown in red, indicate that the in-domain and OOD distributions are significantly different at the 99% confidence level.
  • Figure S2: Correlation coefficients for relationship between $E_\textrm{g}^{\textrm{GW}}$ and 18 features for 270 binary and ternary inorganic compounds. The left-lower and right-upper triangles represent $r_p$ and $r_s$, respectively.
  • ...and 9 more figures