Table of Contents
Fetching ...

RGFN: Synthesizable Molecular Generation Using GFlowNets

Michał Koziarski, Andrei Rekesh, Dmytro Shevchuk, Almer van der Sloot, Piotr Gaiński, Yoshua Bengio, Cheng-Hao Liu, Mike Tyers, Robert A. Batey

TL;DR

This paper proposes Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates.

Abstract

Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates. We demonstrate that with the proposed set of reactions and building blocks, it is possible to obtain a search space of molecules orders of magnitude larger than existing screening libraries coupled with low cost of synthesis. We also show that the approach scales to very large fragment libraries, further increasing the number of potential molecules. We demonstrate the effectiveness of the proposed approach across a range of oracle models, including pretrained proxy models and GPU-accelerated docking.

RGFN: Synthesizable Molecular Generation Using GFlowNets

TL;DR

This paper proposes Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates.

Abstract

Generative models hold great promise for small molecule discovery, significantly increasing the size of search space compared to traditional in silico screening libraries. However, most existing machine learning methods for small molecule generation suffer from poor synthesizability of candidate compounds, making experimental validation difficult. In this paper we propose Reaction-GFlowNet (RGFN), an extension of the GFlowNet framework that operates directly in the space of chemical reactions, thereby allowing out-of-the-box synthesizability while maintaining comparable quality of generated candidates. We demonstrate that with the proposed set of reactions and building blocks, it is possible to obtain a search space of molecules orders of magnitude larger than existing screening libraries coupled with low cost of synthesis. We also show that the approach scales to very large fragment libraries, further increasing the number of potential molecules. We demonstrate the effectiveness of the proposed approach across a range of oracle models, including pretrained proxy models and GPU-accelerated docking.
Paper Structure (32 sections, 12 equations, 32 figures, 4 tables)

This paper contains 32 sections, 12 equations, 32 figures, 4 tables.

Figures (32)

  • Figure 1: Illustration of RGFN sampling process. At the beginning, the RGFN selects an initial molecular building block. In the next two steps, a reaction and a proper reactant are chosen. Then the in silico reaction is simulated with RDKit's RunReactants functionality and one of the resulting molecules is selected. The process is repeated until the stop action is chosen. The obtained molecule is then evaluated using the reward function.
  • Figure 2: Estimation of the state space size of RGFN as a function of the maximum number of allowed reactions. RGFN (350) indicates a variant using 350 hand-picked inexpensive building blocks, while RGFN (8350) also uses 8,000 randomly selected Enamine building blocks. Enamine REAL (6.5B compounds) is shown as a reference.
  • Figure 3: Distributions of rewards across different tasks.
  • Figure 4: Number of discovered modes as a function of normalized iterations. Log scale used.
  • Figure 5: The number of discovered Murcko scaffolds with sEH proxy value above 7 (a) and 8 (b) as a function of fragment library size. We compare standard independent embeddings of fragment selection actions (blue) with our fingerprint-based embeddings (orange) that account for the fragments' chemical structure. The number of scaffolds is reported after 2k training iterations for 3 random seeds (the solid line is the median, while the shaded area spans from minimum to maximum values). We observe that our approach greatly outperforms independent embedding when scaling to a larger action space.
  • ...and 27 more figures