SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery
Shion Honda, Shoi Shi, Hiroki R. Ueda
TL;DR
This work tackles the challenge of accurate molecular property prediction with limited labeled data by learning data-driven fingerprints from unlabeled SMILES using a Transformer-based pre-training approach. SMILES Transformer yields 1024-dimensional fingerprints that can be fed into simple predictors, achieving superior data efficiency on several MoleculeNet tasks and providing insights via latent-space visualization. A novel Data Efficiency Metric (DEM) is introduced to quantify performance as training data size changes, highlighting ST’s advantages in small-data regimes. While competitive with state-of-the-art baselines in large-data settings, ST offers meaningful gains for low-data drug discovery and opens avenues for longer-context and multi-task pre-training.
Abstract
In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.
