Table of Contents
Fetching ...

Leap: molecular synthesisability scoring with intermediates

Antonia Calvi, Théophile Gaudin, Dominik Miketa, Dominique Sydow, Liam Wilbraham

TL;DR

Leap introduces a GPT-2–based synthesisability scorer that can condition on available intermediates to better estimate the practical difficulty of synthesising a target molecule. It relies on pre-training to learn multi-step retrosynthesis routes encoded as tree-like strings, then fine-tunes to predict the synthetic-tree depth with and without intermediates, using AiZynthFinder for route generation. Across public and project molecules, Leap outperforms existing scorers (SAScore, SCScore, RAScore) by at least 5% AUC in identifying synthesisable compounds, and it adapts its scores when relevant intermediates are provided. The method enables fast, dynamic assessment of synthetic tractability within generative workflows and offers robust performance even in out-of-domain settings.

Abstract

Assessing whether a molecule can be synthesised is a primary task in drug discovery. It enables computational chemists to filter for viable compounds or bias molecular generative models. The notion of synthesisability is dynamic as it evolves depending on the availability of key compounds. A common approach in drug discovery involves exploring the chemical space surrounding synthetically-accessible intermediates. This strategy improves the synthesisability of the derived molecules due to the availability of key intermediates. Existing synthesisability scoring methods such as SAScore, SCScore and RAScore, cannot condition on intermediates dynamically. Our approach, Leap, is a GPT-2 model trained on the depth, or longest linear path, of predicted synthesis routes that allows information on the availability of key intermediates to be included at inference time. We show that Leap surpasses all other scoring methods by at least 5% on AUC score when identifying synthesisable molecules, and can successfully adapt predicted scores when presented with a relevant intermediate compound.

Leap: molecular synthesisability scoring with intermediates

TL;DR

Leap introduces a GPT-2–based synthesisability scorer that can condition on available intermediates to better estimate the practical difficulty of synthesising a target molecule. It relies on pre-training to learn multi-step retrosynthesis routes encoded as tree-like strings, then fine-tunes to predict the synthetic-tree depth with and without intermediates, using AiZynthFinder for route generation. Across public and project molecules, Leap outperforms existing scorers (SAScore, SCScore, RAScore) by at least 5% AUC in identifying synthesisable compounds, and it adapts its scores when relevant intermediates are provided. The method enables fast, dynamic assessment of synthetic tractability within generative workflows and offers robust performance even in out-of-domain settings.

Abstract

Assessing whether a molecule can be synthesised is a primary task in drug discovery. It enables computational chemists to filter for viable compounds or bias molecular generative models. The notion of synthesisability is dynamic as it evolves depending on the availability of key compounds. A common approach in drug discovery involves exploring the chemical space surrounding synthetically-accessible intermediates. This strategy improves the synthesisability of the derived molecules due to the availability of key intermediates. Existing synthesisability scoring methods such as SAScore, SCScore and RAScore, cannot condition on intermediates dynamically. Our approach, Leap, is a GPT-2 model trained on the depth, or longest linear path, of predicted synthesis routes that allows information on the availability of key intermediates to be included at inference time. We show that Leap surpasses all other scoring methods by at least 5% on AUC score when identifying synthesisable molecules, and can successfully adapt predicted scores when presented with a relevant intermediate compound.
Paper Structure (17 sections, 5 figures, 1 table)

This paper contains 17 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Schematic diagram of our workflow. a) shows an example of a synthesis tree of depth 3, and its representation as a string; b) shows input and expected output of our model during pre-training, while c) shows the input and expected output of our model fine-tuned to predict tree depth for a molecule with and without intermediate.
  • Figure 2: ROC-AUC curves for all scorers on both test and project molecules.
  • Figure 3: Distribution of scores assigned by Leap, SAScore, SCScore and RAScore for molecules with synthetic routes of different depths.
  • Figure 5: Barplot showing the AUC score for the various scorers when we do and do not have a key intermediate for molecules.
  • Figure 6: Distributions of predicted depths for both project and test molecules when false, true and no intermediates are supplied.