Table of Contents
Fetching ...

BERT Learns (and Teaches) Chemistry

Josh Payne, Mario Srouji, Dian Ang Yap, Vineet Kosaraju

TL;DR

The paper tackles the challenge of learning chemistry from SMILES representations by applying a transformer-based BERT model to discover functional-group–level substructures through attention. It pretrains on large molecular datasets (e.g., ZINC250k) and analyzes attention to identify meaningful chemical motifs, then transfers learned representations to downstream tasks (toxicity, solubility, drug-likeness, SAS) and graph-based models (GCN, GAT) via feature augmentation and fine-tuning. The findings show that attention heads can align with chemically active substructures and potential reaction sites, with pretraining providing benefits for regression tasks, though gains on graph-based predictions are limited. The authors also propose attention visualization as a practical tool for chemists and outline future work to scale data and enforce SMILES-equivalence in embeddings to better capture molecular structure.

Abstract

Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.

BERT Learns (and Teaches) Chemistry

TL;DR

The paper tackles the challenge of learning chemistry from SMILES representations by applying a transformer-based BERT model to discover functional-group–level substructures through attention. It pretrains on large molecular datasets (e.g., ZINC250k) and analyzes attention to identify meaningful chemical motifs, then transfers learned representations to downstream tasks (toxicity, solubility, drug-likeness, SAS) and graph-based models (GCN, GAT) via feature augmentation and fine-tuning. The findings show that attention heads can align with chemically active substructures and potential reaction sites, with pretraining providing benefits for regression tasks, though gains on graph-based predictions are limited. The authors also propose attention visualization as a practical tool for chemists and outline future work to scale data and enforce SMILES-equivalence in embeddings to better capture molecular structure.

Abstract

Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.

Paper Structure

This paper contains 19 sections, 5 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Visualization of BERT trained on SMILES strings. SMILES sequences are tokenized into functional groups, and the encoded embeddings with positional and token embeddings are fed in as input to BERT, the encoder structure of transformer. The light blue box denote attention mechanisms used in BERT.
  • Figure 2: The attention heads focus more heavily on the carbon chains; a carbon chain's length can help determine solubility.
  • Figure 3: Left: illustration of chemical reaction on how aspartate reacts with $\alpha$-ketoglurate to form glutamate. Right: Visualization of layer 3 of attention heads in BERT on different molecules in aspartate.
  • Figure 4: Molecular Group discovery of different chemical compounds.
  • Figure 5: Illustration of BERT used for multi-task learning after pretraining. We test the Zinc250k dataset after pretraining on 3 different task: LogP, the partition coefficient (left), SAS, synthetic accessibility score (middle), QED, the quantitative estimation of drug-likeness (right). Top: training set MSE, bottom: test set MSE.