Table of Contents
Fetching ...

Generative Artificial Intelligence for Navigating Synthesizable Chemical Space

Wenhao Gao, Shitong Luo, Connor W. Coley

TL;DR

This work introduces SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space and demonstrates the scalability of the approach via the improvement in performance as more computational resources become available.

Abstract

We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space. Unlike traditional molecular generation approaches, we generate synthetic pathways for molecules to ensure that designs are synthetically tractable. By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design. We demonstrate SynFormer's effectiveness in two key applications: (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle. Additionally, we demonstrate the scalability of our approach via the improvement in performance as more computational resources become available. With our code and trained models openly available, we hope that SynFormer will find use across applications in drug discovery and materials science.

Generative Artificial Intelligence for Navigating Synthesizable Chemical Space

TL;DR

This work introduces SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space and demonstrates the scalability of the approach via the improvement in performance as more computational resources become available.

Abstract

We introduce SynFormer, a generative modeling framework designed to efficiently explore and navigate synthesizable chemical space. Unlike traditional molecular generation approaches, we generate synthetic pathways for molecules to ensure that designs are synthetically tractable. By incorporating a scalable transformer architecture and a diffusion module for building block selection, SynFormer surpasses existing models in synthesizable molecular design. We demonstrate SynFormer's effectiveness in two key applications: (1) local chemical space exploration, where the model generates synthesizable analogs of a reference molecule, and (2) global chemical space exploration, where the model aims to identify optimal molecules according to a black-box property prediction oracle. Additionally, we demonstrate the scalability of our approach via the improvement in performance as more computational resources become available. With our code and trained models openly available, we hope that SynFormer will find use across applications in drug discovery and materials science.
Paper Structure (31 sections, 12 equations, 21 figures, 1 table)

This paper contains 31 sections, 12 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: Schematic illustration of the SynFormer framework and architecture. (A) The SynFormer-ED architecture is an encoder-decoder that takes a molecule as input and outputs a synthetic route to the same or an analogous molecule. (B) SynFormer-D is a decoder-only framework designed to generate synthetic routes. (C) Synthetic routes are tokenized using a postfix notation to make them amenable to autoregressive generation. Routes are constructed by applying 115 reactions to a set of 223,244 molecular building blocks, covering a synthesizable chemical space estimated as $>10^{60}$ molecules. (D) During generation, a token generated by the transformer is first classified by token type. If the token represents a reaction or reactant (building block), it undergoes an additional classification process to select the appropriate reactions or a denoising diffusion process followed by a nearest neighbor search to select the appropriate building block(s).
  • Figure 2: Model performance on molecular reconstruction. (A and B) Comparison of the reconstruction rate and average structural (Tanimoto) similarity between input and output molecules for SynFormer-ED, ChemProjector luo2024projecting, and SynNet gao2021amortized on 1,000 randomly selected molecules from (A) REAL Diversity Set and (B) ChEMBL Database. (C) Scaling of model performance is measured by the training loss (binary cross-entropy (BCE) of the molecular fingerprint (FP) prediction) as model size and training data size increase.
  • Figure 3: Application of SynFormer in projecting unsynthesizable design into synthesizable chemical space. (A) Schematic illustration of SynFormer-ED generating synthesizable structural analogs for unsynthesizable molecules. (B) Normalized distribution of SA scores for the originally designed molecules and their corresponding synthesizable analogs. Note that the distributions are normalized to a peak value of 1 for a clearer comparison. (C) Scatter plot comparing objective scores of originally designed molecules versus their generated analogs. Points are colored by the structural similarity between them, showing that structurally similar analogs tend to possess close properties. (D) Examples of originally designed unsynthesizable molecules and their SynFormer-generated synthesizable analogs. Objective scores are shown beneath each molecule, demonstrating comparable activities for the generated analogs. The modified fragments or atoms are highlighted in light blue. (E) The workflow shows SynFormer-ED generating synthesizable analogs for ligands generated by structure-based drug design. (F) Scatter plot comparing the Vina docking scores of originally designed ligands and their generated analogs. Points are colored based on the structural (Tanimoto) similarity between the input and output, showing strong agreement in general and an ability to generate analogs with comparable scores. (G) The original design and generated analog for Estrogen receptor alpha, with their SA Score and Vina score below. The modified fragments or atoms are highlighted in light blue.
  • Figure 4: Application of SynFormer in hit expansion. (A) Schematic illustrating SynFormer-ED expanding a known hit compound into structurally similar, synthesizable analogs. (B) Normalized distribution of predicted JNK3 inhibition scores for screening ZINC250k, nearest neighbors in Enamine REAL of the hits, and SynFormer-generated analogs of the hits, highlighting the enrichment of high-scoring ligands among SynFormer analogs. Note that the distributions are normalized to a peak value of 1 for a clearer comparison. (C) Examples of hits from screening ZINC250k and corresponding best-scored generated analogs, alongside the best-scored molecules identified in the nearest neighbor search within Enamine REAL, demonstrate SynFormer-ED’s ability to generate high scoring, synthesizable compounds. The motifs retained from the hits are colored in red. (D and E) Expanding experimentally validated ligands for the PKM2 target (PDB: 3ME3) (D) and the KAT2A target (PDB: 5MLJ) (E). Three representative analogs are shown for each target, along with their synthetic pathways. The structural motifs retained from the hits are colored in red.
  • Figure 5: Application of SynFormer in global chemical space exploration. (A) Illustration of fine-tuning SynFormer-D with reinforcement learning. (B) Performance comparison of SynFormer fine-tuning with reinforcement learning (SF-RL) against other popular methods, showing the average top-10 molecule scores versus the number of oracle calls gao2022sample. The plots represent the mean performance curves of 5 independent runs, with the shaded region for SF-RL indicating the range across these runs. (C) Illustration of a genetic algorithm with SynFormer-ED used for mutation steps. (D) AUC Top-10 performance comparison across different molecular design methods (GraphGA-SF, GraphGA jensen2019graph, AugMem guo2024augmented, SynNet gao2021amortized, DoG-Gen bradshaw2020barking) for four tasks from GuacaMol brown2019guacamol. (E) Distribution of SA Scores ertl2009estimation for the top 25 molecules at various optimization steps, with colors representing objective scores, demonstrating how SynFormer effectively constraints its design space to synthesizable space exclusively. (F) The best molecules generated by GraphGA (top) and GraphGA-SF (bottom) show that GraphGA-SF identifies a more synthetically tractable candidate with a reduced SA score, albeit with a minor sacrifice in the objective score.
  • ...and 16 more figures