Retrosynthesis prediction enhanced by in-silico reaction data augmentation
Xu Zhang, Yiming Mo, Wenguan Wang, Yi Yang
TL;DR
The paper tackles the data scarcity bottleneck in ML-based retrosynthesis by introducing RetroWISE, a self-boosting framework that generates in-silico reactions from unpaired data using a base model trained on real paired data, and augments real data to train a superior retrosynthesis predictor. Through template-based filtering and molecular similarity checks, RetroWISE improves both accuracy and diversity of predictions, achieving state-of-the-art results on USPTO benchmarks and especially enhancing rare transformations. The approach leverages a Transformer-based architecture with forward and retrosynthesis components, and demonstrates favorable scalability and efficiency, including substantial gains on the largest datasets. Overall, RetroWISE offers a cost-effective data augmentation strategy that mitigates the need for expansive proprietary reaction databases, accelerating ML-driven retrosynthesis research and application.
Abstract
Recent advances in machine learning (ML) have expedited retrosynthesis research by assisting chemists to design experiments more efficiently. However, all ML-based methods consume substantial amounts of paired training data (i.e., chemical reaction: product-reactant(s) pair), which is costly to obtain. Moreover, companies view reaction data as a valuable asset and restrict the accessibility to researchers. These issues prevent the creation of more powerful retrosynthesis models due to their data-driven nature. As a response, we exploit easy-to-access unpaired data (i.e., one component of product-reactant(s) pair) for generating in-silico paired data to facilitate model training. Specifically, we present RetroWISE, a self-boosting framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation using unpaired data, ultimately leading to a superior model. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models (e.g., +8.6% top-1 accuracy on the USPTO-50K test dataset). Moreover, it consistently improves the prediction accuracy of rare transformations. These results show that Retro- WISE overcomes the training bottleneck by in-silico reactions, thereby paving the way toward more effective ML-based retrosynthesis models.
