Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs
Mianchu Wang, Giovanni Montana
TL;DR
This work reframes retrosynthesis as a worst-path optimisation on tree-structured MDPs, arguing that a route is only viable if every leaf is a purchasable building block. It introduces InterRetro, which learns a value function for worst-path outcomes and uses weighted self-imitation to iteratively improve a constrained single-step policy, guaranteeing monotonic improvement and a unique optimal value function $V^*$. The method enables close-to-inference-time planning without search and achieves state-of-the-art results, including 100% success on Retro*-190 and substantial route-length reductions, with strong sample efficiency. The approach holds potential for broader applications where system reliability is determined by the weakest component, and it highlights future work on reaction feasibility and cost integration. Overall, InterRetro provides a principled, search-free framework for robust multi-step planning in complex, tree-structured decision problems.
Abstract
Retrosynthesis planning aims to decompose target molecules into available building blocks, forming a synthetic tree where each internal node represents an intermediate compound and each leaf ideally corresponds to a purchasable reactant. However, this tree becomes invalid if any leaf node is not a valid building block, making the planning process vulnerable to the "weakest link" in the synthetic route. Existing methods often optimise for average performance across branches, failing to account for this worst-case sensitivity. In this paper, we reframe retrosynthesis as a worst-path optimisation problem within tree-structured Markov Decision Processes (MDPs). We prove that this formulation admits a unique optimal solution and provides monotonic improvement guarantees. Building on this insight, we introduce Interactive Retrosynthesis Planning (InterRetro), a method that interacts with the tree MDP, learns a value function for worst-path outcomes, and improves its policy through self-imitation, preferentially reinforcing past decisions with high estimated advantage. Empirically, InterRetro achieves state-of-the-art results - solving 100% of targets on the Retro*-190 benchmark, shortening synthetic routes by 4.9%, and achieving promising performance using only 10% of the training data.
