Table of Contents
Fetching ...

Molecule Attention Transformer

Łukasz Maziarka, Tomasz Danel, Sławomir Mucha, Krzysztof Rataj, Jacek Tabor, Stanisław Jastrzębski

TL;DR

Molecule Attention Transformer (MAT) extends the Transformer encoder by integrating molecular graph structure and inter-atomic distances into the self-attention mechanism, yielding a versatile model for diverse molecular property prediction tasks. MAT demonstrates competitive performance across a wide benchmark and, with simple node-level self-supervised pretraining, achieves state-of-the-art results while drastically reducing hyperparameter tuning needs. The approach provides chemically interpretable attention heads and shows robust transfer when pretrained on large molecular corpora. The work highlights a practical path toward easier-to-use, data-efficient deep learning for drug discovery and material design.

Abstract

Designing a single neural network architecture that performs competitively across a range of molecule property prediction tasks remains largely an open challenge, and its solution may unlock a widespread use of deep learning in the drug discovery industry. To move towards this goal, we propose Molecule Attention Transformer (MAT). Our key innovation is to augment the attention mechanism in Transformer using inter-atomic distances and the molecular graph structure. Experiments show that MAT performs competitively on a diverse set of molecular prediction tasks. Most importantly, with a simple self-supervised pretraining, MAT requires tuning of only a few hyperparameter values to achieve state-of-the-art performance on downstream tasks. Finally, we show that attention weights learned by MAT are interpretable from the chemical point of view.

Molecule Attention Transformer

TL;DR

Molecule Attention Transformer (MAT) extends the Transformer encoder by integrating molecular graph structure and inter-atomic distances into the self-attention mechanism, yielding a versatile model for diverse molecular property prediction tasks. MAT demonstrates competitive performance across a wide benchmark and, with simple node-level self-supervised pretraining, achieves state-of-the-art results while drastically reducing hyperparameter tuning needs. The approach provides chemically interpretable attention heads and shows robust transfer when pretrained on large molecular corpora. The work highlights a practical path toward easier-to-use, data-efficient deep learning for drug discovery and material design.

Abstract

Designing a single neural network architecture that performs competitively across a range of molecule property prediction tasks remains largely an open challenge, and its solution may unlock a widespread use of deep learning in the drug discovery industry. To move towards this goal, we propose Molecule Attention Transformer (MAT). Our key innovation is to augment the attention mechanism in Transformer using inter-atomic distances and the molecular graph structure. Experiments show that MAT performs competitively on a diverse set of molecular prediction tasks. Most importantly, with a simple self-supervised pretraining, MAT requires tuning of only a few hyperparameter values to achieve state-of-the-art performance on downstream tasks. Finally, we show that attention weights learned by MAT are interpretable from the chemical point of view.

Paper Structure

This paper contains 53 sections, 3 equations, 8 figures, 19 tables.

Figures (8)

  • Figure 1: Molecule Attention Transformer architecture. We largely base our model on the Transformer encoder. In the first layer we embed each atom using one-hot encoding and atomic features. The main innovation is the Molecule Multi-Head Self-Attention layer that augments attention with distance and graph structure of the molecule. We implement this using a weighted (by $\lambda_d$, $\lambda_g$, and $\lambda_a$) element-wise sum of the corresponding matrices.
  • Figure 2: The average rank across the 7 datasets in the benchmark. For each model we test $500$ (left) or $150$ (right) hyperparameter combinations. We split the data using random or scaffold split (according to the dataset description) 6 times into train/validation/test folds and use the mean metrics across the test folds to obtain the ranklists of models. Interestingly, shallow models (RF and SVM) outperform graph models (GCN, EAGCN and Weave).
  • Figure 3: The average ranks across the 7 datasets in the benchmark. Pretrained MAT outperforms the other methods, despite a drastically smaller number of tested hyperparameters ($7$) compared to MAT and EAGCN ($500$).
  • Figure 4: Test performance of all models as a function of the number of tested hyperparameter combinations (on a logarithmic scale). Figures show the aggregated mean RMSE for regression tasks (left) and the aggregated mean ROC AUC for classification tasks (right). Pretrained MAT requires tuning an order of magnitude less hyperparameters, and performs competitively on both sets of tasks.
  • Figure 5: The heatmaps show selected self-attention weights from the first layer of MAT, on a random molecule from the BBBP dataset (center). The atoms, which these heads focus on, are marked with the same color as the corresponding matrix. The interpretation of the presented patterns is described in the text.
  • ...and 3 more figures