Table of Contents
Fetching ...

Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks

Marwin H. S. Segler, Thierry Kogej, Christian Tyrchan, Mark P. Waller

TL;DR

This study demonstrates that SMILES-based recurrent neural networks can learn the grammar of drug-like molecules from large catalogs and generate extensive, drug-like libraries. By applying transfer learning to small sets of actives and coupling generation with a target-prediction scorer, the approach produces focused libraries enriched for activity against targets such as 5-HT2A, Plasmodium falciparum, and Staphylococcus aureus. The work shows substantial enrichment (EOR values up to ~66.9) and the feasibility of iterating design-synthesis-test cycles entirely in silico, even without initial actives. Overall, the method offers a simple, data-driven path to de novo drug design that complements docking and synthesis planning, with clear potential for rapid exploration of chemical space and scaffold diversification.

Abstract

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active towards a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against Staphylococcus aureus, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against Plasmodium falciparum (Malaria) it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete de novo drug design cycle to generate large sets of novel molecules for drug discovery.

Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks

TL;DR

This study demonstrates that SMILES-based recurrent neural networks can learn the grammar of drug-like molecules from large catalogs and generate extensive, drug-like libraries. By applying transfer learning to small sets of actives and coupling generation with a target-prediction scorer, the approach produces focused libraries enriched for activity against targets such as 5-HT2A, Plasmodium falciparum, and Staphylococcus aureus. The work shows substantial enrichment (EOR values up to ~66.9) and the feasibility of iterating design-synthesis-test cycles entirely in silico, even without initial actives. Overall, the method offers a simple, data-driven path to de novo drug design that complements docking and synthesis planning, with clear potential for rapid exploration of chemical space and scaffold diversification.

Abstract

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target. In this work, we show that recurrent neural networks can be trained as generative models for molecular structures, similar to statistical language models in natural language processing. We demonstrate that the properties of the generated molecules correlate very well with the properties of the molecules used to train the model. In order to enrich libraries with molecules active towards a given biological target, we propose to fine-tune the model with small sets of molecules, which are known to be active against that target. Against Staphylococcus aureus, the model reproduced 14% of 6051 hold-out test molecules that medicinal chemists designed, whereas against Plasmodium falciparum (Malaria) it reproduced 28% of 1240 test molecules. When coupled with a scoring function, our model can perform the complete de novo drug design cycle to generate large sets of novel molecules for drug discovery.

Paper Structure

This paper contains 19 sections, 4 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Examples of molecules and their Smiles representation. To correctly create smiles, the model has to learn long term dependencies, for example to close rings (indicated by numbers) and brackets.
  • Figure 2: a) Recursively defined RNN b) The same RNN, unrolled. The parameters $\theta$ (the weight matrices of the neural network) are shared over all time steps.
  • Figure 3: The Symbol Generation and Sampling Process. We start with a random seed symbol $\mathbf{s}_1$, here c, which gets converted into a one-hot vector $\mathbf{x}_1$ and input into the model. The model then updates its internal state $\mathbf{h}_0$ to $\mathbf{h}_1$ and outputs $\mathbf{y}_1$, which is the probability distribution over the next symbols. Here, sampling yields $\mathbf{s}_2=$1. Converting $\mathbf{s}_2$ to $\mathbf{x}_2$, and feeding it to the model leads to updated hidden state $\mathbf{h}_2$ and output $\mathbf{y}_2$, from which can sample again. This iterative symbol-by-symbol procedure can be continued as long as desired. In this example, we stop it after observing an EOL (\\ n) symbol, and obtain the Smiles for benzene. The hidden state $\mathbf{h}_i$ allows the model to keep track of opened brackets and rings, to ensure that they will be closed again later.
  • Figure 4: A few randomly selected, generated molecules. Ad = Adamantyl
  • Figure 5: t-SNE projection of 7 physicochemical descriptors of random molecules from ChEMBL (blue) and molecules generated with the neural network trained on ChEMBL (green), to two unitless dimensions. The distributions of both sets overlap significantly.
  • ...and 12 more figures