Table of Contents
Fetching ...

Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs

Mianchu Wang, Giovanni Montana

TL;DR

This work reframes retrosynthesis as a worst-path optimisation on tree-structured MDPs, arguing that a route is only viable if every leaf is a purchasable building block. It introduces InterRetro, which learns a value function for worst-path outcomes and uses weighted self-imitation to iteratively improve a constrained single-step policy, guaranteeing monotonic improvement and a unique optimal value function $V^*$. The method enables close-to-inference-time planning without search and achieves state-of-the-art results, including 100% success on Retro*-190 and substantial route-length reductions, with strong sample efficiency. The approach holds potential for broader applications where system reliability is determined by the weakest component, and it highlights future work on reaction feasibility and cost integration. Overall, InterRetro provides a principled, search-free framework for robust multi-step planning in complex, tree-structured decision problems.

Abstract

Retrosynthesis planning aims to decompose target molecules into available building blocks, forming a synthetic tree where each internal node represents an intermediate compound and each leaf ideally corresponds to a purchasable reactant. However, this tree becomes invalid if any leaf node is not a valid building block, making the planning process vulnerable to the "weakest link" in the synthetic route. Existing methods often optimise for average performance across branches, failing to account for this worst-case sensitivity. In this paper, we reframe retrosynthesis as a worst-path optimisation problem within tree-structured Markov Decision Processes (MDPs). We prove that this formulation admits a unique optimal solution and provides monotonic improvement guarantees. Building on this insight, we introduce Interactive Retrosynthesis Planning (InterRetro), a method that interacts with the tree MDP, learns a value function for worst-path outcomes, and improves its policy through self-imitation, preferentially reinforcing past decisions with high estimated advantage. Empirically, InterRetro achieves state-of-the-art results - solving 100% of targets on the Retro*-190 benchmark, shortening synthetic routes by 4.9%, and achieving promising performance using only 10% of the training data.

Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs

TL;DR

This work reframes retrosynthesis as a worst-path optimisation on tree-structured MDPs, arguing that a route is only viable if every leaf is a purchasable building block. It introduces InterRetro, which learns a value function for worst-path outcomes and uses weighted self-imitation to iteratively improve a constrained single-step policy, guaranteeing monotonic improvement and a unique optimal value function . The method enables close-to-inference-time planning without search and achieves state-of-the-art results, including 100% success on Retro*-190 and substantial route-length reductions, with strong sample efficiency. The approach holds potential for broader applications where system reliability is determined by the weakest component, and it highlights future work on reaction feasibility and cost integration. Overall, InterRetro provides a principled, search-free framework for robust multi-step planning in complex, tree-structured decision problems.

Abstract

Retrosynthesis planning aims to decompose target molecules into available building blocks, forming a synthetic tree where each internal node represents an intermediate compound and each leaf ideally corresponds to a purchasable reactant. However, this tree becomes invalid if any leaf node is not a valid building block, making the planning process vulnerable to the "weakest link" in the synthetic route. Existing methods often optimise for average performance across branches, failing to account for this worst-case sensitivity. In this paper, we reframe retrosynthesis as a worst-path optimisation problem within tree-structured Markov Decision Processes (MDPs). We prove that this formulation admits a unique optimal solution and provides monotonic improvement guarantees. Building on this insight, we introduce Interactive Retrosynthesis Planning (InterRetro), a method that interacts with the tree MDP, learns a value function for worst-path outcomes, and improves its policy through self-imitation, preferentially reinforcing past decisions with high estimated advantage. Empirically, InterRetro achieves state-of-the-art results - solving 100% of targets on the Retro*-190 benchmark, shortening synthetic routes by 4.9%, and achieving promising performance using only 10% of the training data.

Paper Structure

This paper contains 22 sections, 5 theorems, 44 equations, 5 figures, 2 tables.

Key Result

Proposition 1

The Q-function $Q^\pi(s, a)$ equals its immediate reward plus the discounted value of its next states: The proof is provided in Appendix proof:q_and_v.

Figures (5)

  • Figure 1: Single-step prediction decomposes a molecule into reactants, whereas multi-step planning searches for a synthetic route, aiming to reach purchasable building blocks.
  • Figure 2: Distribution of reactant counts in the USPTO-50k dataset.
  • Figure 3: Examples illustrating the tree MDP formulation. Each non-leaf node represents a molecule that is decomposed into one or more reactants. Left tree: A successful synthetic route for target molecule $\mathrm{A}$. It contains $4$ root-to-leaf paths: $P(\tau)=\{\mathrm{ABD}, \mathrm{ABEF}, \mathrm{ABEGH}, \mathrm{AC}\}$. Since all leaf nodes are building blocks, each path receives a value of $\gamma^T$, where $T$ is the path length. The tree's overall value is $\min_{p \in P(\tau)} \{\gamma^2, \gamma^3, \gamma^4, \gamma\}=\gamma^4$, determined by the longest path. Right tree: A failed synthesis attempt for molecule $\mathrm{A}$. One of its paths, $\mathrm{ABEG}$, terminates at $\mathrm{G}$, which is not a building block. This gives path $\mathrm{ABEG}$ a value of $0$, making the tree's overall value $\min_{p \in P(\tau)} \{\gamma^2, \gamma^3, 0, \gamma\}=0$, illustrating why a single failing path invalidates the entire route.
  • Figure 4: Experimental figures. (a) Performance under different training data usage and computation budgets. (b) Statistics on estimated depth of synthetic trees. (c) Ablations on the advantage coefficient.
  • Figure 5: Predicted synthetic routes on three randomly selected molecules in Retro*-190. The yellow circles highlight the reaction centres.

Theorems & Definitions (9)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • proof
  • proof
  • Proposition 5
  • proof
  • proof