Table of Contents
Fetching ...

Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers

Michal Sadowski, Tadija Radusinović, Maria Wyrzykowska, Lukasz Sztukiewicz, Jan Rzymkowski, Paweł Włodarczyk-Pruszyński, Mikołaj Sacha, Piotr Kozakowski, Ruard van Workum, Stanislaw Kamil Jastrzebski

TL;DR

This work tackles hallucinations in automated retrosynthesis by introducing RetroTrim, an ensemble-based system that fuses a strong SSR generator with three complementary plausibility scorers (Reaction Prior, Reaction Graph Plausibility, and Reference Reaction Scorer) into a MetaScorer. The approach is integrated with a RootAligned SSR generator and Retro* search, and evaluated with a novel human-validated reaction labeling protocol on 32 challenging drug-like targets. RetroTrim uniquely eliminates hallucinated steps while delivering the largest number of high-quality pathways among baselines, earning top performance in a major retrosynthesis challenge. The authors also release the target set and evaluation protocol to advance rigorous, human-centric assessment of retrosynthetic plausibility.

Abstract

Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. This approach formed the basis of our winning solution to the Standard Industries \$1 million Retrosynthesis Challenge. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.

Trustworthy Retrosynthesis: Eliminating Hallucinations with a Diverse Ensemble of Reaction Scorers

TL;DR

This work tackles hallucinations in automated retrosynthesis by introducing RetroTrim, an ensemble-based system that fuses a strong SSR generator with three complementary plausibility scorers (Reaction Prior, Reaction Graph Plausibility, and Reference Reaction Scorer) into a MetaScorer. The approach is integrated with a RootAligned SSR generator and Retro* search, and evaluated with a novel human-validated reaction labeling protocol on 32 challenging drug-like targets. RetroTrim uniquely eliminates hallucinated steps while delivering the largest number of high-quality pathways among baselines, earning top performance in a major retrosynthesis challenge. The authors also release the target set and evaluation protocol to advance rigorous, human-centric assessment of retrosynthetic plausibility.

Abstract

Retrosynthesis is one of the domains transformed by the rise of generative models, and it is one where the problem of nonsensical or erroneous outputs (hallucinations) is particularly insidious: reliable assessment of synthetic plans is time-consuming, with automatic methods lacking. In this work, we present RetroTrim, a retrosynthesis system that successfully avoids nonsensical plans on a set of challenging drug-like targets. Compared to common baselines in the field, our system is not only the sole method that succeeds in filtering out hallucinated reactions, but it also results in the highest number of high-quality paths overall. The key insight behind RetroTrim is the combination of diverse reaction scoring strategies, based on machine learning models and existing chemical databases. We show that our scoring strategies capture different classes of hallucinations by analyzing them on a dataset of labeled retrosynthetic intermediates. This approach formed the basis of our winning solution to the Standard Industries \$1 million Retrosynthesis Challenge. To measure the performance of retrosynthesis systems, we propose a novel evaluation protocol for reactions and synthetic paths based on a structured review by expert chemists. Using this protocol, we compare systems on a set of 32 novel targets, curated to reflect recent trends in drug structures. While the insights behind our methodology are broadly applicable to retrosynthesis, our focus is on targets in the drug-like domain. By releasing our benchmark targets and the details of our evaluation protocol, we hope to inspire further research into reliable retrosynthesis.

Paper Structure

This paper contains 49 sections, 3 equations, 35 figures.

Figures (35)

  • Figure 1: An example of grossly incorrect (hallucinated) reaction generated by a Single-Step Retrosynthesis model. A PhD-level chemist recognizes that the only reasonable atom mapping between the substrates and the product is one where the reaction center is an ortho-amino benzoate converting into a triazole (highlighted in yellow). It does not belong to any commonly known reaction class, and further investigation involving extensive searches of synthetic databases yields no examples that would inform what reagents and conditions could induce such a reaction. Executing this transformation would be impractical and require the development of a novel synthetic methodology, which typically entails a multi-month research program.
  • Figure 2: Visualization of RetroTrim (above) and retrosynthetic search with plausibility filtering (below). RetroTrim encompasses a generator, which proposes precursor molecules for a given target, and a scorer, which evaluates the plausibility of the generated reaction. In the search process, plausible precursors are expanded further, until we arrive at commercially available starting materials. Implausible reactions terminate the search branch. The search concludes when a complete synthetic route from commercially-available starting materials to the target molecule is found.
  • Figure 3: Comparison of our retrosynthesis generator (RootAligned) with different scorers against IBM RXN, AiZynthFinder, LocalRetro, and RetroChimera. Among AiZynthFinder, IBM RXN, LocalRetro and RetroChimera, RetroChimera performs significantly better than others, but it still fails on >25% targets, with a significant number of hallucinations. RootAligned without any reaction scorer finds pathways for all targets but includes unreliable routes. Introduction of individual scorers trades coverage for reliability, with RP eliminating all Nonsense pathways. RetroTrim, backed by the MetaScorer produces the most trustworthy results.
  • Figure 4: ROC (on the left) and precision-recall (on the right) curves comparing the performance of individual scorers versus the MetaScorer. The MetaScorer achieve higher AUC values for both ROC and PR curves, indicating better discrimination between plausible and implausible reactions. Among the individual scorers, RP shows the best performance.
  • Figure 5: ROC-AUC performance of individual scorers across different failure categories, with sample sizes indicated for each category.
  • ...and 30 more figures