Transformers for molecular property prediction: Lessons learned from the past five years

Afnan Sultan; Jochen Sieg; Miriam Mathea; Andrea Volkamer

Transformers for molecular property prediction: Lessons learned from the past five years

Afnan Sultan, Jochen Sieg, Miriam Mathea, Andrea Volkamer

TL;DR

This review analyzes transformer-based approaches for molecular property prediction (MPP), covering architectural variants, data sources, tokenization, pretraining objectives, and fine-tuning strategies. It finds that large unlabeled pretraining is feasible with databases like ZINC, ChEMBL, and PubChem, but performance gains depend on data composition and domain-specific objectives, not merely scale. The literature shows competitive results but suffers from inconsistent benchmarks and evaluation practices, hindering fair comparisons. The authors advocate for standardized data splits, robust statistics, and exploration of 2D/3D-aware representations and efficient fine-tuning to enhance generalization, explanation, and practical impact in MPP.

Abstract

Molecular Property Prediction (MPP) is vital for drug discovery, crop protection, and environmental science. Over the last decades, diverse computational techniques have been developed, from using simple physical and chemical properties and molecular fingerprints in statistical models and classical machine learning to advanced deep learning approaches. In this review, we aim to distill insights from current research on employing transformer models for MPP. We analyze the currently available models and explore key questions that arise when training and fine-tuning a transformer model for MPP. These questions encompass the choice and scale of the pre-training data, optimal architecture selections, and promising pre-training objectives. Our analysis highlights areas not yet covered in current research, inviting further exploration to enhance the field's understanding. Additionally, we address the challenges in comparing different models, emphasizing the need for standardized data splitting and robust statistical analysis.

Transformers for molecular property prediction: Lessons learned from the past five years

TL;DR

Abstract

Paper Structure (23 sections, 1 equation, 9 figures, 16 tables)

This paper contains 23 sections, 1 equation, 9 figures, 16 tables.

Introduction
The transformer model
How transformer models work
Adopted variants of the transformer model for MPP
Molecular transformer models
Data sets used for training molecular transformers
Pre-training data sets
ZINC
ChEMBL
PubChem
Subsets of these major data sources:
Downstream data sets
The current SOTA performance for some downstream data sets
The decisions to consider when implementing a transformer model for MPP
Which database to use for pre-training, and how many molecules should it contain?
...and 8 more sections

Figures (9)

Figure 1: An overview of the transformer model and where each chemical language model fits. The transformer model with the encoder-decoder modules is a sequence-to-sequence model that can be used for tasks like reaction prediction. However, each module can be used independently to provide more specialized performance. For example, the encoder module can be used to predict properties, while the decoder module can be used to generate novel molecules.
Figure 2: A comparison between the ROC-AUC and RMSE ranges for the reviewed articles, some classical machine learning (ML) algorithms, and some deep learning (DL) models. Scaffold splitting was used for the classification data sets and random splitting was used for the regression data sets by all the models. Only models with the same color were tested on the same test sets. The values for the transformers models span some models in Table \ref{['tab:articles']}, which are shown on the side legend. The reported classical ML models are RF and SVM. The DL models span graph-based and different DNN models like D-MPNN, Weave, etc. The values for classical ML and DL categories are obtained from the comparisons done by the transformer models shown on the side legend. Supplementary Table \ref{['tab:models_comparable_comparisons']} shows which classical ML and DL models were used for comparison by each transformer model. Data and code used to generate this figure can be found in https://github.com/volkamerlab/Transformers4MPP_review/tree/main.
Figure 3: An overview of the individual components of the transformer model and which decisions were explored by which article. Black text represents the set of questions that were asked in the following subsections.
Figure 4: Comparison between the performance and pre-training data set size by A) MolFormer ross2022large and B) ChemBERTa-2 ahmad2022chemberta. The models are sorted in ascending order based on the size of the pre-training data set. $^*$ MLM = Masked language modeling. MTR = Multi-task regression. Data and code used to generate this figure can be found in https://github.com/volkamerlab/Transformers4MPP_review/tree/main.
Figure 5: Comparison between the performance of the string-based models and models that used different representation inputs. A) A comparison between the Mol-BERT li2021mol model trained on Morgan fingerprints of radius one and SMILES-BERT wang2019smiles trained on SMILES. B) A comparison between the MAT maziarka2020molecule model trained on a list of atoms and SMILES-Transformer (ST) honda2019smiles trained on SMILES. The figure shows the average performance for each model with error bars (SE) or standard deviation (SD). The data and code used to generate this figure can be found in https://github.com/volkamerlab/Transformers4MPP_review/tree/main.
...and 4 more figures

Transformers for molecular property prediction: Lessons learned from the past five years

TL;DR

Abstract

Transformers for molecular property prediction: Lessons learned from the past five years

Authors

TL;DR

Abstract

Table of Contents

Figures (9)