Table of Contents
Fetching ...

Multimodal Language and Graph Learning of Adsorption Configuration in Catalysis

Janghoon Ock, Srivathsan Badrinarayanan, Rishikesh Magar, Akshay Antony, Amir Barati Farimani

TL;DR

This study improves the predictive language model by aligning its latent space with well-established graph neural networks through a self-supervised process called graph-assisted pretraining, which reduces the mean absolute error of energy prediction for adsorption configurations by 7.4–9.8%, redirecting the model’s attention towards adsorption configuration.

Abstract

Adsorption energy is a reactivity descriptor that must be accurately predicted for effective machine learning (ML) application in catalyst screening. This process involves determining the lowest energy across various adsorption configurations on a catalytic surface, which can exhibit very similar energy values. While graph neural networks (GNNs) have shown great success in computing the energy of catalyst systems, they rely heavily on atomic spatial coordinates. In contrast, transformer-based language models can directly use human-readable text inputs, potentially bypassing the need for detailed atomic positions. However, these language models often struggle with accurately predicting the energy of adsorption configurations. Our study addresses this limitation by introducing a self-supervised multi-modal learning approach called graph-assisted pretraining, which connects well-established GNNs with emerging language model applications. This method reduces the MAE of energy prediction for adsorption configurations by about 10%. Furthermore, our findings demonstrate that graph-assisted pretraining enhances fine-tuning with different datasets, indicating the transferability of this approach. This method also redirects the model's attention toward adsorption configuration, rather than individual adsorbate and catalyst information, similar to common domain knowledge. Building on this, we propose using generative large language models to create text inputs for the predictive model, based solely on chemical composition and surface orientation, without relying on exact atomic positions. This demonstrates a potential use case of language models in energy prediction without geometric information.

Multimodal Language and Graph Learning of Adsorption Configuration in Catalysis

TL;DR

This study improves the predictive language model by aligning its latent space with well-established graph neural networks through a self-supervised process called graph-assisted pretraining, which reduces the mean absolute error of energy prediction for adsorption configurations by 7.4–9.8%, redirecting the model’s attention towards adsorption configuration.

Abstract

Adsorption energy is a reactivity descriptor that must be accurately predicted for effective machine learning (ML) application in catalyst screening. This process involves determining the lowest energy across various adsorption configurations on a catalytic surface, which can exhibit very similar energy values. While graph neural networks (GNNs) have shown great success in computing the energy of catalyst systems, they rely heavily on atomic spatial coordinates. In contrast, transformer-based language models can directly use human-readable text inputs, potentially bypassing the need for detailed atomic positions. However, these language models often struggle with accurately predicting the energy of adsorption configurations. Our study addresses this limitation by introducing a self-supervised multi-modal learning approach called graph-assisted pretraining, which connects well-established GNNs with emerging language model applications. This method reduces the MAE of energy prediction for adsorption configurations by about 10%. Furthermore, our findings demonstrate that graph-assisted pretraining enhances fine-tuning with different datasets, indicating the transferability of this approach. This method also redirects the model's attention toward adsorption configuration, rather than individual adsorbate and catalyst information, similar to common domain knowledge. Building on this, we propose using generative large language models to create text inputs for the predictive model, based solely on chemical composition and surface orientation, without relying on exact atomic positions. This demonstrates a potential use case of language models in energy prediction without geometric information.
Paper Structure (25 sections, 2 equations, 8 figures, 6 tables)

This paper contains 25 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of the model training framework. (a) The training process consists of two steps: graph-assisted pretraining and energy prediction fine-tuning. (b) The CatBERTa model is used as the text encoder. (c) The EquiformerV2 model serves as the graph encoder, and the graph embedding from the final layer is converted to a 1D format by reshaping and max pooling the collection of atom embeddings. The architecture image is reproduced from the original EquiformerV2 paperequiformerv2.
  • Figure 2: Model inference framework. Both structure data from the Open Catalyst datasets and CIFs generated by fine-tuned CrystaLLM can be converted into textual strings compatible with CatBERTa input, following the string conversion logic shown in the bottom right box. Generated CIFs provide structure information, including atomic positions, types, and unit cell details.
  • Figure 3: Analysis of similarity scores and sectional attention with and without graph-assisted pertaining. (a) and (b) displays similarity score analysis. (c) and (d) shows sectional attention score comparison. The left panels are without graph-assisted pretraining, while the right panels are with it. These results are derived from model predictions, which were trained on the OC20 dataset and evaluated using text strings from the GemNet-OC-relaxed structures.
  • Figure 4: CrystaLLM framework. (a) illustrates the fine-tuning step using the CIFs from the relaxed structures in the OC20 and OC20-Dense training datasets. (b) depicts the inference process using the provided adsorbate and catalyst pair information. (c) shows visualization examples. These ground truth systems are sourced from the OC20 validation set, matching composition and surface orientation.
  • Figure 5: Enhancement from LLM-derived strings as input for the CatBERTa model. (a) Twelve example adsorbate-catalyst pairs are sampled from the 66 pairs. Blue dots represent the energy of different adsorption configurations for each adsorbate-catalyst pair. The number of adsorption configurations for the twelve example pairs ranges from 4 to 130, with a mean value of 62.5. (b) Prediction Inclusion Ratio (PIR) for each case quantifies the improvement in prediction accuracy across 66 pairs. The term 'config.' refers to the LLM-derived configurations strings.
  • ...and 3 more figures