Table of Contents
Fetching ...

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

Shion Honda, Shoi Shi, Hiroki R. Ueda

TL;DR

This work tackles the challenge of accurate molecular property prediction with limited labeled data by learning data-driven fingerprints from unlabeled SMILES using a Transformer-based pre-training approach. SMILES Transformer yields 1024-dimensional fingerprints that can be fed into simple predictors, achieving superior data efficiency on several MoleculeNet tasks and providing insights via latent-space visualization. A novel Data Efficiency Metric (DEM) is introduced to quantify performance as training data size changes, highlighting ST’s advantages in small-data regimes. While competitive with state-of-the-art baselines in large-data settings, ST offers meaningful gains for low-data drug discovery and opens avenues for longer-context and multi-task pre-training.

Abstract

In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.

SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery

TL;DR

This work tackles the challenge of accurate molecular property prediction with limited labeled data by learning data-driven fingerprints from unlabeled SMILES using a Transformer-based pre-training approach. SMILES Transformer yields 1024-dimensional fingerprints that can be fed into simple predictors, achieving superior data efficiency on several MoleculeNet tasks and providing insights via latent-space visualization. A novel Data Efficiency Metric (DEM) is introduced to quantify performance as training data size changes, highlighting ST’s advantages in small-data regimes. While competitive with state-of-the-art baselines in large-data settings, ST offers meaningful gains for low-data drug discovery and opens avenues for longer-context and multi-task pre-training.

Abstract

In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The illustration of SMILES Transformer pre-training and fingerprint extraction.
  • Figure 2: Comparison of model performance against different train size on the 10 datasets. The top row indicates the results for the physical chemistry datasets, the second row indicates biophysics, and the two bottom rows indicate physiology, respectively. The scores were averaged over 20 trials and the error bars are the standard deviations
  • Figure 3: Visualization of the latent space of SMILES Transformer. For three datasets, FreeSolv, BBBP, and ClinTox, the dimensions of ST fingerprints of the molecules are reduced to 2 with t-SNE t-sne. Then, the nearest neighbors of the 12 data points on a trajectories are plotted on the latent space (left panel). The 12 points are decoded to molecules and shown in the right panel. The color bar of the top left panel indicates the standardized free energy.
  • Figure 4: ROC-AUC scores on each stratified group by the lengths of SMILES (left) and the distributions of the lengths of SMILES (right) of BBBP dataset.