Table of Contents
Fetching ...

Retrosynthesis prediction enhanced by in-silico reaction data augmentation

Xu Zhang, Yiming Mo, Wenguan Wang, Yi Yang

TL;DR

The paper tackles the data scarcity bottleneck in ML-based retrosynthesis by introducing RetroWISE, a self-boosting framework that generates in-silico reactions from unpaired data using a base model trained on real paired data, and augments real data to train a superior retrosynthesis predictor. Through template-based filtering and molecular similarity checks, RetroWISE improves both accuracy and diversity of predictions, achieving state-of-the-art results on USPTO benchmarks and especially enhancing rare transformations. The approach leverages a Transformer-based architecture with forward and retrosynthesis components, and demonstrates favorable scalability and efficiency, including substantial gains on the largest datasets. Overall, RetroWISE offers a cost-effective data augmentation strategy that mitigates the need for expansive proprietary reaction databases, accelerating ML-driven retrosynthesis research and application.

Abstract

Recent advances in machine learning (ML) have expedited retrosynthesis research by assisting chemists to design experiments more efficiently. However, all ML-based methods consume substantial amounts of paired training data (i.e., chemical reaction: product-reactant(s) pair), which is costly to obtain. Moreover, companies view reaction data as a valuable asset and restrict the accessibility to researchers. These issues prevent the creation of more powerful retrosynthesis models due to their data-driven nature. As a response, we exploit easy-to-access unpaired data (i.e., one component of product-reactant(s) pair) for generating in-silico paired data to facilitate model training. Specifically, we present RetroWISE, a self-boosting framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation using unpaired data, ultimately leading to a superior model. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy on the USPTO-50K test dataset). Moreover, it consistently improves the prediction accuracy of rare transformations. These results show that Retro- WISE overcomes the training bottleneck by in-silico reactions, thereby paving the way toward more effective ML-based retrosynthesis models.

Retrosynthesis prediction enhanced by in-silico reaction data augmentation

TL;DR

The paper tackles the data scarcity bottleneck in ML-based retrosynthesis by introducing RetroWISE, a self-boosting framework that generates in-silico reactions from unpaired data using a base model trained on real paired data, and augments real data to train a superior retrosynthesis predictor. Through template-based filtering and molecular similarity checks, RetroWISE improves both accuracy and diversity of predictions, achieving state-of-the-art results on USPTO benchmarks and especially enhancing rare transformations. The approach leverages a Transformer-based architecture with forward and retrosynthesis components, and demonstrates favorable scalability and efficiency, including substantial gains on the largest datasets. Overall, RetroWISE offers a cost-effective data augmentation strategy that mitigates the need for expansive proprietary reaction databases, accelerating ML-driven retrosynthesis research and application.

Abstract

Recent advances in machine learning (ML) have expedited retrosynthesis research by assisting chemists to design experiments more efficiently. However, all ML-based methods consume substantial amounts of paired training data (i.e., chemical reaction: product-reactant(s) pair), which is costly to obtain. Moreover, companies view reaction data as a valuable asset and restrict the accessibility to researchers. These issues prevent the creation of more powerful retrosynthesis models due to their data-driven nature. As a response, we exploit easy-to-access unpaired data (i.e., one component of product-reactant(s) pair) for generating in-silico paired data to facilitate model training. Specifically, we present RetroWISE, a self-boosting framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation using unpaired data, ultimately leading to a superior model. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy on the USPTO-50K test dataset). Moreover, it consistently improves the prediction accuracy of rare transformations. These results show that Retro- WISE overcomes the training bottleneck by in-silico reactions, thereby paving the way toward more effective ML-based retrosynthesis models.
Paper Structure (5 sections, 1 equation, 8 figures, 6 tables)

This paper contains 5 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of the RetroWISE framework. a, Given the unpaired reactants $\hat{Y}^\circ$ as an example, the base forward synthesis model $g_{y x}$ trained on real paired data is used to generate in-silico products $\hat{X}^\circ$. Then, a filter process consisting of template matching and molecular similarity comparison selects high-quality in-silico reactions $\hat{R}^\circ$. b, These cheap in-silico reactions are used to augment costly real reactions as paired training data to train a more effective retrosynthesis model $\hat{f}_{x y}$. In this way, the whole framework is self-boosted.
  • Figure 2: Impact of data quantity. a, impact of in-silico data quantity and b, impact of real data quantity. Training with more in-silico and real data both improves the performance. The in-silico data ratio is measured as the number of in-silico reactions divided by the number of real data, and vice versa for the real data ratio.
  • Figure 3: Top-5 accuracy of different types of predictions. RetroWISE achieves excellent results over the baseline on almost every reaction type.
  • Figure 4: Performance on rare transformations.a, Top-k exact match accuracy. b, Top-k MaxFrag match accuracy. RetroWISE achieves consistent improvements on three testing benchmarks of rare transformations.
  • Figure 5: Representative examples of Rare2 predictions. The green part highlights the structure corresponding to the template. RetroWISE produces more accurate predictions than Baseline on rare transformations.
  • ...and 3 more figures