Table of Contents
Fetching ...

Predicting small molecules solubilities on endpoint devices using deep ensemble neural networks

Mayk Caldas Ramos, Andrew D. White

TL;DR

This work addressed problems with a deep learning model with predictive uncertainty that runs on a static website (without a server) that achieves satisfactory results in solubility prediction and demonstrates how to create molecular property prediction models that balance uncertainty and ease of use.

Abstract

Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is usable at https://mol.dev.

Predicting small molecules solubilities on endpoint devices using deep ensemble neural networks

TL;DR

This work addressed problems with a deep learning model with predictive uncertainty that runs on a static website (without a server) that achieves satisfactory results in solubility prediction and demonstrates how to create molecular property prediction models that balance uncertainty and ease of use.

Abstract

Aqueous solubility is a valuable yet challenging property to predict. Computing solubility using first-principles methods requires accounting for the competing effects of entropy and enthalpy, resulting in long computations for relatively poor accuracy. Data-driven approaches, such as deep learning, offer improved accuracy and computational efficiency but typically lack uncertainty quantification. Additionally, ease of use remains a concern for any computational technique, resulting in the sustained popularity of group-based contribution methods. In this work, we addressed these problems with a deep learning model with predictive uncertainty that runs on a static website (without a server). This approach moves computing needs onto the website visitor without requiring installation, removing the need to pay for and maintain servers. Our model achieves satisfactory results in solubility prediction. Furthermore, we demonstrate how to create molecular property prediction models that balance uncertainty and ease of use. The code is available at https://github.com/ur-whitelab/mol.dev, and the model is usable at https://mol.dev.
Paper Structure (11 sections, 4 equations, 3 figures, 2 tables)

This paper contains 11 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Scheme of the deep learning DNN. The molecule is input using the SMILES or SELFIES representation. This representation is converted to a tokenized input based on a vocabulary obtained using the training dataset. A set of models represents the deep ensemble model. Each model consists of an embed layer, two bidirectional RNN (bi-RNN) layers, a normalization layer, and three fully connected layers being down-sized in three steps. Dropout layers are present after the embed and after each fully connected layer during training, but they were not represented in this scheme. Predictions of the models in the ensemble are then aggregated.
  • Figure 2: Density distribution of the aleatoric (AU) and epistemic variances (EU) for the: ($i$) kde4$^{LSTM}_{Aug}$ (top six panels) and ($ii$) kde10$^{LSTM}_{Aug}$ (bottom six panels). Increasing ensemble size reduces the extent of the distribution's tail, decreasing uncertainty about predictions. However, the ensemble size does not noticeably affect the distribution center.
  • Figure 3: Parity plots for two selected models being evaluated on the solubility challenge datasets: ($i$) kde4$^{LSTM}_{Aug}$ (top row), and ($ii$) kde10$^{LSTM}_{Aug}$ (botom row). The left, middle, and right columns show the parity plots for solubility challenge 1Llinas2008-rc, 2-set1, and 2-set2Llinas2019-eu, respectively. Pearson correlation coefficient is displayed together with RMSE and MAE. "acc-0.5" stands for the $\pm0.5\textrm{log}\%$ metric. Red dashed lines show the limits for molecules considered a correct prediction when computing the $\pm0.5\textrm{log}\%$. The correlation between predicted values and labels increases when more models are added to the ensemble. RMSE and MAE also follow this pattern. However, the $\pm0.5\textrm{log}\%$ decreases in set-2 of the second solubility challenge dataset (SolChal2-set2). While kde10$^{LSTM}_{Aug}$ improved the prediction of molecules that were being poorly predicted by kde4$^{LSTM}_{Aug}$, the prediction of molecules with smaller errors was not greatly improved.