Table of Contents
Fetching ...

SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

Kunyang Sun, Dorian Bagni, Joseph M. Cavanagh, Yingze Wang, Jacob M. Sawyer, Bo Zhou, Andrew Gritsevskiy, Oufan Zhang, Teresa Head-Gordon

TL;DR

SynLlama addresses the synthetic feasibility gap in de novo molecule generation by fine-tuning a relatively small LLM (Llama3) on reaction-data to predict retrosynthetic routes and align outputs with a large, purchasable reaction-building block space. It defines a practical synthesis space with Enamine BBs and two RXN template sets, and introduces a reconstruction algorithm that maps LLM outputs to valid synthesis pathways or synthesizable analogs. Across unseen drug-like molecules and docking-driven analog generation tasks, SynLlama achieves competitive synthesis planning, improves synthetic accessibility of generated analogs, and enables local hit expansion with validated FEP-guided potency gains. The work demonstrates that data-efficient fine-tuning of an LLM can bridge computational design with experimental synthetic chemistry, enabling actionable and purchasable candidates for medicinal chemistry pipelines.

Abstract

Generative machine learning models for exploring chemical space have shown immense promise, but many molecules they generate are too difficult to synthesize, making them impractical for further investigation or development. In this work, we present a novel approach by fine-tuning Meta's Llama3 Large Language Models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data, and offers strong performance in both forward and bottom-up synthesis planning compared to other state-of-the-art methods. We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data. We also demonstrate the use of SynLlama in a pharmaceutical context for synthesis planning of analog molecules and hit expansion leads for proposed inhibitors of target proteins, offering medicinal chemists a valuable tool for discovery.

SynLlama: Generating Synthesizable Molecules and Their Analogs with Large Language Models

TL;DR

SynLlama addresses the synthetic feasibility gap in de novo molecule generation by fine-tuning a relatively small LLM (Llama3) on reaction-data to predict retrosynthetic routes and align outputs with a large, purchasable reaction-building block space. It defines a practical synthesis space with Enamine BBs and two RXN template sets, and introduces a reconstruction algorithm that maps LLM outputs to valid synthesis pathways or synthesizable analogs. Across unseen drug-like molecules and docking-driven analog generation tasks, SynLlama achieves competitive synthesis planning, improves synthetic accessibility of generated analogs, and enables local hit expansion with validated FEP-guided potency gains. The work demonstrates that data-efficient fine-tuning of an LLM can bridge computational design with experimental synthetic chemistry, enabling actionable and purchasable candidates for medicinal chemistry pipelines.

Abstract

Generative machine learning models for exploring chemical space have shown immense promise, but many molecules they generate are too difficult to synthesize, making them impractical for further investigation or development. In this work, we present a novel approach by fine-tuning Meta's Llama3 Large Language Models (LLMs) to create SynLlama, which generates full synthetic pathways made of commonly accessible building blocks and robust organic reaction templates. SynLlama explores a large synthesizable space using significantly less data, and offers strong performance in both forward and bottom-up synthesis planning compared to other state-of-the-art methods. We find that SynLlama, even without training on external building blocks, can effectively generalize to unseen yet purchasable building blocks, meaning that its reconstruction capabilities extend to a broader synthesizable chemical space than the training data. We also demonstrate the use of SynLlama in a pharmaceutical context for synthesis planning of analog molecules and hit expansion leads for proposed inhibitors of target proteins, offering medicinal chemists a valuable tool for discovery.

Paper Structure

This paper contains 15 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of the SynLlama workflow including data generation, supervised fine-tuning, inference, and reconstruction. (a). The predefined synthesizable chemical space of reaction templates (RXN) and building blocks (BBs) that covers billions of molecules. (b). An example synthesis data and its generation process from the defined synthesizable chemical space to create training examples. Here, RXN 76 represents amide coupling and RXN 72 represents Suzuki coupling. (c). A schematic representation of supervised fine-tuning that converts Llama 3 models to SynLlama models, along with the instruction, input, and output for the example synthesis in (b). (d). SynLlama's inference on an unseen test molecule. Black represents SynLlama's raw retrosynthetic output consisting of RXN sequences and predicted BBs, while colored BBs indicate the top two most similar BBs to the predicted ones from the Enamine building block library. Here, RXN 76 represents amide coupling and RXN 72 represents Suzuki coupling. (e). Reconstructed molecules using the predicted reaction sequences and similar building blocks from the Enamine building block library. In this example, all predicted building blocks are present in the Enamine library, allowing for the complete reconstruction of the input molecule and the generation of close analogs.
  • Figure 2: SynLlama performance on generating synthesizable analogs for Pocket2Mol and iMiner proposed binders of SARS2 MProzhang2021potent, Thrombinthrombin, and TYK2tyk2tyk2_2. Correlation plot comparing docking scores of (a) Pocket2Mol and (b) iMiner generated molecules and the average Vina docking scores of ten most similar analogs from SynLlama trained with RXN 2. Each data point is color-coded by the average Morgan fingerprint similarity computed between the generated and analog molecules. The shaded area represents an energy uncertainty range of $\pm2 kcal/mol$ for dockingTrott2010. (c) Synthetic accessibility (SA) score distribution of Pocket2Mol, iMiner, and unsynthesizable molecules and SynLlama-proposed analogs. iMiner analogs generated with SynLlama trained on RXN 1 showed similar results as reported in Supplementary Figure S6. The kernel density in Supplementary Figure S7 further confirms our finding that the analogs consistently shift toward better SA without undermining the overall docking score distribution. (d) average Morgan fingerprint similarity score between the target molecules and their top-10 proposed analogs.
  • Figure 3: Examples of synthesizable analog generation for SARS2 MPro using iMiner and TYK2 and Thrombin with Pocket2Mol. (a) Docked pose visualization for all three protein targets. (b) Docking and SA scores for iMiner target and SynLlama analog for SARS2 MPro along with the predicted synthetic pathway. (c) Docking and SA scores for the Pocket2Mol targets and the SynLlama analogs for TYK2 and Thrombin along with their predicted synthetic pathways.
  • Figure 4: Hit expansion of binders to SARS2 Mpro, Thrombin, and Tyk2 with SynLlama. (a,c,e) Synllama-predicted synthetic pathways that expand on the hit molecules for each protein target. The places of substitution are labeled as R groups. (b,d,f) Binding free energies of the hit compounds and SynLlama-expanded analogs. Color scheme on the proposed substitution is the same as the predicted synthetic pathways. All potential binders either have a better FEP binding free energy or are within the 1 kcal/mol uncertainty range compared to the original hits.
  • Figure :