Table of Contents
Fetching ...

GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

Suryanarayanan Balaji, Rishikesh Magar, Yayati Jadhav, Amir Barati Farimani

TL;DR

GPT-MolBERTa introduces a text-description-based molecular representation learned via self-supervised pretraining on ChatGPT-generated descriptions of approximately 326k molecules.The model leverages RoBERTa/BERT encoders and is fine-tuned on MoleculeNet benchmarks, achieving competitive classification and strong regression performance, approaching state-of-the-art in several tasks.Pretraining on textual molecular descriptions improves downstream predictions, and attention analyses offer interpretability by highlighting descriptive cues linked to molecular features.This approach provides a data-efficient alternative to SMILES/graph representations and points to future work with larger-scale text pretraining and contrastive learning.

Abstract

With the emergence of Transformer architectures and their powerful understanding of textual data, a new horizon has opened up to predict the molecular properties based on text description. While SMILES are the most common form of representation, they are lacking robustness, rich information and canonicity, which limit their effectiveness in becoming generalizable representations. Here, we present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties. A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules. To predict the properties for the downstream tasks, both BERT and RoBERTa models were used in the finetuning stage. Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks. Additionally, further analysis of the attention mechanisms show that GPT-MolBERTa is able to pick up important information from the input textual data, displaying the interpretability of the model.

GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction

TL;DR

GPT-MolBERTa introduces a text-description-based molecular representation learned via self-supervised pretraining on ChatGPT-generated descriptions of approximately 326k molecules.The model leverages RoBERTa/BERT encoders and is fine-tuned on MoleculeNet benchmarks, achieving competitive classification and strong regression performance, approaching state-of-the-art in several tasks.Pretraining on textual molecular descriptions improves downstream predictions, and attention analyses offer interpretability by highlighting descriptive cues linked to molecular features.This approach provides a data-efficient alternative to SMILES/graph representations and points to future work with larger-scale text pretraining and contrastive learning.

Abstract

With the emergence of Transformer architectures and their powerful understanding of textual data, a new horizon has opened up to predict the molecular properties based on text description. While SMILES are the most common form of representation, they are lacking robustness, rich information and canonicity, which limit their effectiveness in becoming generalizable representations. Here, we present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties. A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules. To predict the properties for the downstream tasks, both BERT and RoBERTa models were used in the finetuning stage. Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks. Additionally, further analysis of the attention mechanisms show that GPT-MolBERTa is able to pick up important information from the input textual data, displaying the interpretability of the model.
Paper Structure (14 sections, 2 equations, 4 figures, 4 tables)

This paper contains 14 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of GPT-MolBERTa. SMILES strings are sent to ChatGPT, which generates rich textual descriptions consisting of information about functional groups, molecular weight, density, and other properties. These descriptions are then used to pretrain a RoBERTa model. The model is then fine-tuned on MoleculeNet datasets, with the addition of a classification/regression head to the first token embeddings.
  • Figure 2: Effect of Pretraining on GPT-MolBERTa with (a) Classification tasks and (b) Regression tasks. The comparison between the pretrained model and the model trained from scratch is demonstrated for each dataset.
  • Figure 3: A sample attention map from the model. Given a sample description, it highlights the sections of the descriptions according to its attention scores, showing how the model focuses on specific aspects of the descriptions.
  • Figure 4: t-SNE Embeddings of the First Token of GPT-MolBERTa for a) ESOL and b) FreeSolv datasets: Each point in this plot represents log solvation energy for ESOL and free hydration energy for FreeSolv.