DirectMultiStep: Direct Route Generation for Multistep Retrosynthesis
Yu Shee, Anton Morgunov, Haote Li, Victor S. Batista
TL;DR
DirectMultiStep reframes retrosynthesis as direct multistep route generation using transformer-based mixture-of-experts models, bypassing iterative SSR workflows and exponential search. The paper demonstrates that DMS variants, especially the MoE-based DMS-Flex Duo, achieve substantial gains in Top-K route accuracy and generalize to unseen FDA-approved drugs, while conditioning on route length and starting materials reduces model size and increases performance. Comprehensive analyses show how MoE architectures improve generation cost, how route length impacts accuracy, and how long routes can be effectively decomposed into shorter subroutes. The work provides a scalable, flexible path toward fully automated, multistep retrosynthetic planning and releases code and data for broader adoption and benchmarking.
Abstract
Traditional computer-aided synthesis planning (CASP) methods rely on iterative single-step predictions, leading to exponential search space growth that limits efficiency and scalability. We introduce a series of transformer-based models, that leverage a mixture of experts approach to directly generate multistep synthetic routes as a single string, conditionally predicting each transformation based on all preceding ones. Our DMS Explorer XL model, which requires only target compounds as input, outperforms state-of-the-art methods on the PaRoutes dataset with 1.9x and 3.1x improvements in Top-1 accuracy on the n$_1$ and n$_5$ test sets, respectively. Providing additional information, such as the desired number of steps and starting materials, enables both a reduction in model size and an increase in accuracy, highlighting the benefits of incorporating more constraints into the prediction process. The top-performing DMS-Flex (Duo) model scores 25-50% higher on Top-1 and Top-10 accuracies for both n$_1$ and n$_5$ sets. Additionally, our models successfully predict routes for FDA-approved drugs not included in the training data, demonstrating strong generalization capabilities. While the limited diversity of the training set may affect performance on less common reaction types, our multistep-first approach presents a promising direction towards fully automated retrosynthetic planning.
