Table of Contents
Fetching ...

DirectMultiStep: Direct Route Generation for Multistep Retrosynthesis

Yu Shee, Anton Morgunov, Haote Li, Victor S. Batista

TL;DR

DirectMultiStep reframes retrosynthesis as direct multistep route generation using transformer-based mixture-of-experts models, bypassing iterative SSR workflows and exponential search. The paper demonstrates that DMS variants, especially the MoE-based DMS-Flex Duo, achieve substantial gains in Top-K route accuracy and generalize to unseen FDA-approved drugs, while conditioning on route length and starting materials reduces model size and increases performance. Comprehensive analyses show how MoE architectures improve generation cost, how route length impacts accuracy, and how long routes can be effectively decomposed into shorter subroutes. The work provides a scalable, flexible path toward fully automated, multistep retrosynthetic planning and releases code and data for broader adoption and benchmarking.

Abstract

Traditional computer-aided synthesis planning (CASP) methods rely on iterative single-step predictions, leading to exponential search space growth that limits efficiency and scalability. We introduce a series of transformer-based models, that leverage a mixture of experts approach to directly generate multistep synthetic routes as a single string, conditionally predicting each transformation based on all preceding ones. Our DMS Explorer XL model, which requires only target compounds as input, outperforms state-of-the-art methods on the PaRoutes dataset with 1.9x and 3.1x improvements in Top-1 accuracy on the n$_1$ and n$_5$ test sets, respectively. Providing additional information, such as the desired number of steps and starting materials, enables both a reduction in model size and an increase in accuracy, highlighting the benefits of incorporating more constraints into the prediction process. The top-performing DMS-Flex (Duo) model scores 25-50% higher on Top-1 and Top-10 accuracies for both n$_1$ and n$_5$ sets. Additionally, our models successfully predict routes for FDA-approved drugs not included in the training data, demonstrating strong generalization capabilities. While the limited diversity of the training set may affect performance on less common reaction types, our multistep-first approach presents a promising direction towards fully automated retrosynthetic planning.

DirectMultiStep: Direct Route Generation for Multistep Retrosynthesis

TL;DR

DirectMultiStep reframes retrosynthesis as direct multistep route generation using transformer-based mixture-of-experts models, bypassing iterative SSR workflows and exponential search. The paper demonstrates that DMS variants, especially the MoE-based DMS-Flex Duo, achieve substantial gains in Top-K route accuracy and generalize to unseen FDA-approved drugs, while conditioning on route length and starting materials reduces model size and increases performance. Comprehensive analyses show how MoE architectures improve generation cost, how route length impacts accuracy, and how long routes can be effectively decomposed into shorter subroutes. The work provides a scalable, flexible path toward fully automated, multistep retrosynthetic planning and releases code and data for broader adoption and benchmarking.

Abstract

Traditional computer-aided synthesis planning (CASP) methods rely on iterative single-step predictions, leading to exponential search space growth that limits efficiency and scalability. We introduce a series of transformer-based models, that leverage a mixture of experts approach to directly generate multistep synthetic routes as a single string, conditionally predicting each transformation based on all preceding ones. Our DMS Explorer XL model, which requires only target compounds as input, outperforms state-of-the-art methods on the PaRoutes dataset with 1.9x and 3.1x improvements in Top-1 accuracy on the n and n test sets, respectively. Providing additional information, such as the desired number of steps and starting materials, enables both a reduction in model size and an increase in accuracy, highlighting the benefits of incorporating more constraints into the prediction process. The top-performing DMS-Flex (Duo) model scores 25-50% higher on Top-1 and Top-10 accuracies for both n and n sets. Additionally, our models successfully predict routes for FDA-approved drugs not included in the training data, demonstrating strong generalization capabilities. While the limited diversity of the training set may affect performance on less common reaction types, our multistep-first approach presents a promising direction towards fully automated retrosynthetic planning.
Paper Structure (22 sections, 15 figures, 17 tables)

This paper contains 22 sections, 15 figures, 17 tables.

Figures (15)

  • Figure 1: The workflow of DirectMultiStep. (a) The SMILES representation of the target compound (blue), starting material (red, optional), and the number of steps (optionally) are tokenized, concatenated, and fed into our transformer model. The model predicts a string representation of the multistep synthesis tree. Spaces are added for clarity, and indentations indicate the levels in the synthesis route (tree). (b) Molecular structures corresponding to the target compound (blue), starting material (red, optional), and the predicted synthesis tree with structures of all molecules.
  • Figure 2: Distribution of the relative frequencies of route lengths (in terms of number of steps) in the training dataset before augmentation with permutations (163 689 routes, black), n$_1$ test set (10 000 routes, blue), and n$_5$ test set (10 000 routes, purple). Distribution is split into routes shorter (left subplot) and longer than 6 steps (right subplot).
  • Figure 3: Distribution of Top-1 and Top-10 accuracy of predictions with DMS-Flex (Duo) on test sets n$_1$ and n$_5$. There is only one route with length 10 in n$_1$, and DMS-Flex (Duo) does not predict it correctly. That route is reproduced by splitting it in half, as shown in Fig. \ref{['fig:n1step10']}
  • Figure 4: Separation of a 10-step route from set-n$_1$ into two 5-step routes. (a) Correct prediction from DMS-Flex (Duo) for the first half of the 10-step route with starting material information (red). (b) Correct prediction from DMS-Flex (Duo) for the second half of the 10-step route with starting material information.
  • Figure 5: Literature routes for Vonoprazan and Mitapivat correctly reproduced by DMS-Flex (Duo). Target compounds are in blue, starting materials that are given as inputs are colored in red. Ranks denote the rank of this route when the specified starting material is provided. (a) First literature route for Vonoprazan. The model predicts correctly no matter which starting material is given. (b) Second literature route for Vonoprazan. The model predicts the route correctly only when an immediate precursor to Vonoprazan is given as the target compound. (c) First literature route for Mitapivat. (d) Second literature route for Mitapivat.
  • ...and 10 more figures