RetroGFN: Diverse and Feasible Retrosynthesis using GFlowNets
Piotr Gaiński, Michał Koziarski, Krzysztof Maziarz, Marwin Segler, Jacek Tabor, Marek Śmieja
TL;DR
This work tackles single-step retrosynthesis by addressing the limited diversity and feasibility of reactions in existing datasets. It introduces RetroGFN, a GFlowNet-based model guided by a reaction feasibility proxy to explore beyond the training set and generate diverse, feasible reactions. Empirically, RetroGFN achieves competitive top-$k$ accuracy and superior top-$k$ round-trip accuracy on USPTO-50k and USPTO-MIT, arguing that round-trip feasibility better aligns with practical synthesis planning. The paper also argues for reporting round-trip metrics, demonstrates diversity advantages, and discusses drug-design implications, with future work focusing on inference improvements and more robust feasibility modeling.
Abstract
Single-step retrosynthesis aims to predict a set of reactions that lead to the creation of a target molecule, which is a crucial task in molecular discovery. Although a target molecule can often be synthesized with multiple different reactions, it is not clear how to verify the feasibility of a reaction, because the available datasets cover only a tiny fraction of the possible solutions. Consequently, the existing models are not encouraged to explore the space of possible reactions sufficiently. In this paper, we propose a novel single-step retrosynthesis model, RetroGFN, that can explore outside the limited dataset and return a diverse set of feasible reactions by leveraging a feasibility proxy model during the training. We show that RetroGFN achieves competitive results on standard top-k accuracy while outperforming existing methods on round-trip accuracy. Moreover, we provide empirical arguments in favor of using round-trip accuracy, which expands the notion of feasibility with respect to the standard top-k accuracy metric.
