Table of Contents
Fetching ...

Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation

Anton Morgunov, Victor S. Batista

TL;DR

The paper introduces RetroCast, an open-source evaluation suite that standardizes disparate retrosynthesis outputs into a common schema and pairs it with SynthArena for qualitative route inspection. It demonstrates that traditional Stock-Termination Rate can misrepresent chemical validity and shows, via Multi-Ground-Truth evaluation, that architectural differences emerge between search-based and sequence-based approaches. By stratifying benchmarks and incorporating a cost-performance frontier, the work reveals a complexity cliff where long-range planning exposes weaknesses in search-based methods and underscores the need for plausibility-focused metrics. The authors provide extensive, reproducible benchmarks and a public data/leaderboard ecosystem to shift the field toward chemistry-aware evaluation and transparent, community-driven progress.

Abstract

Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between "solvability" (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a "complexity cliff" in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.

Procrustean Bed for AI-Driven Retrosynthesis: A Unified Framework for Reproducible Evaluation

TL;DR

The paper introduces RetroCast, an open-source evaluation suite that standardizes disparate retrosynthesis outputs into a common schema and pairs it with SynthArena for qualitative route inspection. It demonstrates that traditional Stock-Termination Rate can misrepresent chemical validity and shows, via Multi-Ground-Truth evaluation, that architectural differences emerge between search-based and sequence-based approaches. By stratifying benchmarks and incorporating a cost-performance frontier, the work reveals a complexity cliff where long-range planning exposes weaknesses in search-based methods and underscores the need for plausibility-focused metrics. The authors provide extensive, reproducible benchmarks and a public data/leaderboard ecosystem to shift the field toward chemistry-aware evaluation and transparent, community-driven progress.

Abstract

Progress in computer-aided synthesis planning (CASP) is obscured by the lack of standardized evaluation infrastructure and the reliance on metrics that prioritize topological completion over chemical validity. We introduce RetroCast, a unified evaluation suite that standardizes heterogeneous model outputs into a common schema to enable statistically rigorous, apples-to-apples comparison. The framework includes a reproducible benchmarking pipeline with stratified sampling and bootstrapped confidence intervals, accompanied by SynthArena, an interactive platform for qualitative route inspection. We utilize this infrastructure to evaluate leading search-based and sequence-based algorithms on a new suite of standardized benchmarks. Our analysis reveals a divergence between "solvability" (stock-termination rate) and route quality; high solvability scores often mask chemical invalidity or fail to correlate with the reproduction of experimental ground truths. Furthermore, we identify a "complexity cliff" in which search-based methods, despite high solvability rates, exhibit a sharp performance decay in reconstructing long-range synthetic plans compared to sequence-based approaches. We release the full framework, benchmark definitions, and a standardized database of model predictions to support transparent and reproducible development in the field.

Paper Structure

This paper contains 22 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: The babel of retrosynthesis formats. Illustration of five fundamentally incompatible output formats for a single reference synthetic route (A). Placeholders correspond to the target (T), an intermediate (I), and purchasable leaves (L1, L2, L3). Formats range from verbose, explicit graph structures (B, D, E) to concise, implicit string-based representations (C, F). (B) a simpler nested json of only molecule nodes, where reactions are implicit. (C) a declarative string mapping products to reactants. (D) a nested json where molecule nodes alternate with explicit reaction nodes. (E) a schema where a route is defined as a list of edges that reference a separate map of nodes. (F) a linear "recipe" string where the product of one step becomes an implicit reactant in the next. This heterogeneity necessitates a translation layer like RetroCast for any comparative analysis.
  • Figure 2: High stock-termination rate rewards chemically invalid routes. Analysis reveals how metrics blind to chemical validity can mislead. (A-C) The case of target USPTO-082. A, A "solved" route from the top-performing model hinges on a chemically implausible seven-reactant step. B, The official reference route contains the identical flawed transformation, showing the model's success is an artifact of pattern-matching corrupted data. C, A route from a newer model that avoids the nonsensical step is penalized for failing to find a "solved" path. (D-H) A catalog of chemical hallucinations from other "solved" routes, demonstrating the systemic nature of the issue. Violations include: D, mass balance error (un-sourced chloro-phenyl group); E, implausible transformation (tartaric acid as a propargyl source); F, mass balance error (un-sourced pyridylmethylamine); G, implausible reaction (amino acids to tryptophan core); H, unspecified reagent (epoxidation with carbonic acid). These cases show that naive stock termination rate fails to capture fundamental chemical principles. Interactive versions are on SynthArena: https://syntharena.ischemist.com/benchmarks/cmisbzsr30000xvdd613ymmbx/targets/cmisbzt2900y4xvddbnu3q2k5?mode=pred-vs-pred&model1=cmise2ax00000qsddkfge5au3&rank1=1&model2=cmisdw7p10000ceddz6l01zhq&rank2=1, https://syntharena.ischemist.com/runs/cmise2ax00000qsddkfge5au3?stock=qhi67k3yqgqhrx49sc3akbih&target=cmisbzt5j01bwxvddy4a5xpu2&rank=1&search=114, https://syntharena.ischemist.com/runs/cmise2ax00000qsddkfge5au3?stock=qhi67k3yqgqhrx49sc3akbih&target=cmisbztag020hxvdd8nl7zg94&rank=1&search=169, https://syntharena.ischemist.com/runs/cmise2ax00000qsddkfge5au3?stock=qhi67k3yqgqhrx49sc3akbih&target=cmisbzt3h0139xvddt5rm50se&rank=1&search=93, https://syntharena.ischemist.com/runs/cmise2ax00000qsddkfge5au3?stock=qhi67k3yqgqhrx49sc3akbih&target=cmisbzsv10066xvddmu0bi5nk&rank=1&search=16, https://syntharena.ischemist.com/runs/cmise2ax00000qsddkfge5au3?stock=qhi67k3yqgqhrx49sc3akbih&target=cmisbztbj025gxvddwrx3reh6&rank=1&search=181.
  • Figure 3: The Economic Trade-offs of Synthesis Planning. Pareto plot of Top-10 route-matching accuracy versus computational cost (USD) on the mkt-cnv-160 benchmark. The efficient frontier (dashed line) illustrates the apparent optimal trade-off, but this landscape is defined by the field's measurement limitations: models in the low-cost region can be predicated on chemically implausible steps, while the accuracy metric itself penalizes the discovery of valid novel routes. Error bars represent 95% bootstrapped CIs.
  • Figure S1: The Cost of Accuracy on Linear Routes. A Pareto plot of Top-10 Accuracy versus Total Cost (USD) on the mkt-lin-500 benchmark. The analysis confirms the trade-off structure seen in Figure 3 is generalizable to linear routes. While absolute costs differ due to the larger benchmark size, the relative cost-performance profiles and the shape of the efficient frontier are consistent, demonstrating a robust relationship between accuracy and computational cost. Error bars are 95% bootstrapped CIs.
  • Figure S2: Selection of a statistically representative benchmark seed for mkt-cnv-160. The plot shows the performance variation of a reference model (DMS Explorer XL) across 15 candidate benchmarks, each generated with a different random seed. Points indicate the mean accuracy (Solvability, Top-1, and Top-10), with horizontal bars representing the bootstrapped 95% confidence intervals. Dashed vertical lines mark the grand mean performance across all seeds. This stability analysis allows us to quantify the variance introduced by subset sampling and select a seed (in this case, 20180329) that yields a benchmark whose metrics are demonstrably close to the central tendency, ensuring our evaluations are robust against sampling artifacts.
  • ...and 3 more figures