Table of Contents
Fetching ...

RetroGFN: Diverse and Feasible Retrosynthesis using GFlowNets

Piotr Gaiński, Michał Koziarski, Krzysztof Maziarz, Marwin Segler, Jacek Tabor, Marek Śmieja

TL;DR

This work tackles single-step retrosynthesis by addressing the limited diversity and feasibility of reactions in existing datasets. It introduces RetroGFN, a GFlowNet-based model guided by a reaction feasibility proxy to explore beyond the training set and generate diverse, feasible reactions. Empirically, RetroGFN achieves competitive top-$k$ accuracy and superior top-$k$ round-trip accuracy on USPTO-50k and USPTO-MIT, arguing that round-trip feasibility better aligns with practical synthesis planning. The paper also argues for reporting round-trip metrics, demonstrates diversity advantages, and discusses drug-design implications, with future work focusing on inference improvements and more robust feasibility modeling.

Abstract

Single-step retrosynthesis aims to predict a set of reactions that lead to the creation of a target molecule, which is a crucial task in molecular discovery. Although a target molecule can often be synthesized with multiple different reactions, it is not clear how to verify the feasibility of a reaction, because the available datasets cover only a tiny fraction of the possible solutions. Consequently, the existing models are not encouraged to explore the space of possible reactions sufficiently. In this paper, we propose a novel single-step retrosynthesis model, RetroGFN, that can explore outside the limited dataset and return a diverse set of feasible reactions by leveraging a feasibility proxy model during the training. We show that RetroGFN achieves competitive results on standard top-k accuracy while outperforming existing methods on round-trip accuracy. Moreover, we provide empirical arguments in favor of using round-trip accuracy, which expands the notion of feasibility with respect to the standard top-k accuracy metric.

RetroGFN: Diverse and Feasible Retrosynthesis using GFlowNets

TL;DR

This work tackles single-step retrosynthesis by addressing the limited diversity and feasibility of reactions in existing datasets. It introduces RetroGFN, a GFlowNet-based model guided by a reaction feasibility proxy to explore beyond the training set and generate diverse, feasible reactions. Empirically, RetroGFN achieves competitive top- accuracy and superior top- round-trip accuracy on USPTO-50k and USPTO-MIT, arguing that round-trip feasibility better aligns with practical synthesis planning. The paper also argues for reporting round-trip metrics, demonstrates diversity advantages, and discusses drug-design implications, with future work focusing on inference improvements and more robust feasibility modeling.

Abstract

Single-step retrosynthesis aims to predict a set of reactions that lead to the creation of a target molecule, which is a crucial task in molecular discovery. Although a target molecule can often be synthesized with multiple different reactions, it is not clear how to verify the feasibility of a reaction, because the available datasets cover only a tiny fraction of the possible solutions. Consequently, the existing models are not encouraged to explore the space of possible reactions sufficiently. In this paper, we propose a novel single-step retrosynthesis model, RetroGFN, that can explore outside the limited dataset and return a diverse set of feasible reactions by leveraging a feasibility proxy model during the training. We show that RetroGFN achieves competitive results on standard top-k accuracy while outperforming existing methods on round-trip accuracy. Moreover, we provide empirical arguments in favor of using round-trip accuracy, which expands the notion of feasibility with respect to the standard top-k accuracy metric.

Paper Structure

This paper contains 42 sections, 9 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Illustration of a single-step retrosynthesis (a), and a corresponding reaction template (b). Atoms from a product pattern on the left side of the template are mapped to atoms from reactant patterns on the right side (red C:i is mapped to blue C:i).
  • Figure 2: Illustration of the template composition process in RetroGFN for an input product. In the first phase, a product pattern and its concrete match to the atoms of the product are chosen. In the second phase, reactant patterns are gathered until all mappable atoms of the product pattern (highlighted red) can be mapped to mappable atoms of the reactant pattern (highlighted blue). In the third phase, the mapping between mappable products and reactant patterns is created, and the obtained template is applied, resulting in the reactants.
  • Figure 3: Illustration of a pattern before (left) and after (right) mapping removal. The mappable atoms of the pattern are colored blue.
  • Figure 4: A plot showing the diversity of the reactions proposed by various single-step retrosynthesis models. It shows the mean number of distinct molecular scaffolds observed in the top-k returned reactions that were predicted as feasible by a forward reaction prediction model (fine-tuned on USPTO-50k and USPTO-MIT, respectively). We observe that RetroGFN is able to return visibly more diverse reactions than other models.
  • Figure 5: Multi-step search results on the Retro* Hard target set with different single-step models. Left: The number of calls until the first solution was found (or $\emptyset$ if a molecule was not solved). The orange line represents the median, the box represents the 25th and 75th percentile, the whiskers represent the 5th and 95th percentile, and points outside this range are shown as dots. Right: Approximate number of non-overlapping routes present in the search graph (tracked over the number of single-step model calls). The solid line represents the median, shaded area shows the 40th and 60th percentiles. On the right-hand side, we note the average time of solving the molecule.
  • ...and 3 more figures