Table of Contents
Fetching ...

Explainable Data-driven Modeling of Adsorption Energy in Heterogeneous Catalysis

Tirtha Vinchurkar, Janghoon Ock, Amir Barati Farimani

TL;DR

The paper tackles interpretability in data-driven adsorption-energy modeling for heterogeneous catalysis by integrating two explainable AI strategies: post-hoc SHAP analysis of shallow ML models and symbolic regression via PySR and SISSO++. Using OC20 data, it identifies adsorbate properties as dominant descriptors and reveals a robust link between adsorption energy and catalyst surface characteristics, including a direct proportionality to the effective coordination number and a squared dependence on catalyst electronegativity in certain regimes ($E_{ads} \propto CN_{cat}$ and $E_{ads} \propto \chi_{cat}^2$). SHAP highlights top features such as $\chi_{ads}$, $N_{ads}$, $\chi_{cat}$, $CN_{cat}$, and $\sum Z_{ads}$, with a positive $\chi_{cat}$–$E_{ads}$ correlation, while symbolic regression provides explicit, physics-aligned equations that corroborate these relationships. The framework demonstrates how explainability can guide catalyst design and high-throughput screening by revealing meaningful descriptor interactions and providing interpretable predictive expressions.

Abstract

The increasing popularity of machine learning (ML) in catalysis has spurred interest in leveraging these techniques to enhance catalyst design. Our study aims to bridge the gap between physics-based studies and data-driven methodologies by integrating ML techniques with eXplainable AI (XAI). Specifically, we employ two XAI techniques: Post-hoc XAI analysis and Symbolic Regression. These techniques help us unravel the correlation between adsorption energy and the properties of the adsorbate-catalyst system. Leveraging a large dataset such as the Open Catalyst Dataset (OC20), we employ a combination of shallow ML techniques and XAI methodologies. Our investigation involves utilizing multiple shallow machine learning techniques to predict adsorption energy, followed by post-hoc analysis for feature importance, inter-feature correlations, and the influence of various feature values on the prediction of adsorption energy. The post-hoc analysis reveals that adsorbate properties exert a greater influence than catalyst properties in our dataset. The top five features based on higher Shapley values are adsorbate electronegativity, the number of adsorbate atoms, catalyst electronegativity, effective coordination number, and the sum of atomic numbers of the adsorbate molecule. There is a positive correlation between catalyst and adsorbate electronegativity with the prediction of adsorption energy. Additionally, symbolic regression yields results consistent with SHAP analysis. It deduces a mathematical relationship indicating that the square of the catalyst electronegativity is directly proportional to the adsorption energy. These consistent correlations resemble those derived from physics-based equations in previous research. Our work establishes a robust framework that integrates ML techniques with XAI, leveraging large datasets like OC20 to enhance catalyst design through model explainability.

Explainable Data-driven Modeling of Adsorption Energy in Heterogeneous Catalysis

TL;DR

The paper tackles interpretability in data-driven adsorption-energy modeling for heterogeneous catalysis by integrating two explainable AI strategies: post-hoc SHAP analysis of shallow ML models and symbolic regression via PySR and SISSO++. Using OC20 data, it identifies adsorbate properties as dominant descriptors and reveals a robust link between adsorption energy and catalyst surface characteristics, including a direct proportionality to the effective coordination number and a squared dependence on catalyst electronegativity in certain regimes ( and ). SHAP highlights top features such as , , , , and , with a positive correlation, while symbolic regression provides explicit, physics-aligned equations that corroborate these relationships. The framework demonstrates how explainability can guide catalyst design and high-throughput screening by revealing meaningful descriptor interactions and providing interpretable predictive expressions.

Abstract

The increasing popularity of machine learning (ML) in catalysis has spurred interest in leveraging these techniques to enhance catalyst design. Our study aims to bridge the gap between physics-based studies and data-driven methodologies by integrating ML techniques with eXplainable AI (XAI). Specifically, we employ two XAI techniques: Post-hoc XAI analysis and Symbolic Regression. These techniques help us unravel the correlation between adsorption energy and the properties of the adsorbate-catalyst system. Leveraging a large dataset such as the Open Catalyst Dataset (OC20), we employ a combination of shallow ML techniques and XAI methodologies. Our investigation involves utilizing multiple shallow machine learning techniques to predict adsorption energy, followed by post-hoc analysis for feature importance, inter-feature correlations, and the influence of various feature values on the prediction of adsorption energy. The post-hoc analysis reveals that adsorbate properties exert a greater influence than catalyst properties in our dataset. The top five features based on higher Shapley values are adsorbate electronegativity, the number of adsorbate atoms, catalyst electronegativity, effective coordination number, and the sum of atomic numbers of the adsorbate molecule. There is a positive correlation between catalyst and adsorbate electronegativity with the prediction of adsorption energy. Additionally, symbolic regression yields results consistent with SHAP analysis. It deduces a mathematical relationship indicating that the square of the catalyst electronegativity is directly proportional to the adsorption energy. These consistent correlations resemble those derived from physics-based equations in previous research. Our work establishes a robust framework that integrates ML techniques with XAI, leveraging large datasets like OC20 to enhance catalyst design through model explainability.
Paper Structure (13 sections, 3 equations, 4 figures, 3 tables)

This paper contains 13 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of XAI Methods: Predicting adsorption energy with shallow machine learning models and symbolic regression. Feature importance is derived from shallow machine learning predictions through post-hoc SHAP analysis. Symbolic regression provides mathematical equations alongside its predictions.
  • Figure 2: Performance Evaluation and Feature Correlation: (a) parity plot illustrating the performance of Adaboost Regression with Base estimator as Random Forest Regressor (Best model). MAE values are calculated for systems with H, O and C1 group. (b) Correlation matrix depicting the relationships between input features. High values of correlation coefficient have been found amongst a few features such as between Local electronegativity and Catalyst electronegativity, Site type and coordination number of adsorbate molecule.
  • Figure 3: SHAP Analysis: (a) Radar plot illustrating feature importance based on Shapley values. (b) Summary Bar Plot presenting all features. (c) Beeswarm plot depicting the relationship between feature values and Shapley values.
  • Figure 4: Feature-Feature Correlation: (a) Scatter plot illustrating the relationship between catalyst electronegativity and local electronegativity, indicating higher values of catalyst electronegativity with increasing local electronegativity. (b) Scatter plot demonstrating the correlation between the number of adsorbate atoms and adsorbate electronegativity, revealing a tendency for adsorbate electronegativity to decrease with an increase in the number of adsorbate atoms.