Table of Contents
Fetching ...

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

Esben Jannik Bjerrum

TL;DR

The paper tackles limited labeled data in QSAR and proposes SMILES enumeration as a data augmentation technique for SMILES-based neural networks. Using an LSTM-QSAR model trained on a DHFR inhibitor dataset, the authors show that enrichment with enumerated SMILES improves predictive accuracy. Key results include an increase in test $R^2$ from $0.56$ to $0.66$ and a reduction in $RMS$ from $0.62$ to $0.55$, with further gains ($R^2=0.68$, $RMS=0.52$) when averaging predictions over enumerated SMILES. The work demonstrates robustness to non-canonical SMILES representations and provides open-source tooling for SMILES enumeration.

Abstract

Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.

SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules

TL;DR

The paper tackles limited labeled data in QSAR and proposes SMILES enumeration as a data augmentation technique for SMILES-based neural networks. Using an LSTM-QSAR model trained on a DHFR inhibitor dataset, the authors show that enrichment with enumerated SMILES improves predictive accuracy. Key results include an increase in test from to and a reduction in from to , with further gains (, ) when averaging predictions over enumerated SMILES. The work demonstrates robustness to non-canonical SMILES representations and provides open-source tooling for SMILES enumeration.

Abstract

Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.

Paper Structure

This paper contains 7 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: SMILES enumeration enables data augmentation. The molecule toluene corresponds to seven different SMILES, the top one is the canonical smile. One data point with toluene in the dataset would thus leads to seven samples in the augmented dataset.
  • Figure 2: Training history for the two datasets and neural networks. A: Neural network trained on canonical SMILES shows a noisy curve where the best model has a test loss of 0.41. B: Neural network trained on enumerated SMILES obtains the best model with a test loss of 0.30. Blue lines are the mean square error without regularization penalty, green is loss including regularization penalty and the red line is mean square error on the test set.
  • Figure 3: Scatter plots of predicted vs. true values. Left column shows scatter plots obtained with the model trained on canonical SMILES only. Right column shows predictions with the model trained on enumerated data. Top row is scatter plots with only canonical SMILES and bottom row is predictions of the enumerated dataset. The blue line denotes the perfect correlation (y = x).
  • Figure 4: Average of predictions from the enumerated model for each molecule. Train set R2 is 0.88 and RMS is 0.38. Test set R2 is 0.68 and RMS is 0.52.