Challenging reaction prediction models to generalize to novel chemistry
John Bradshaw, Anji Zhang, Babak Mahjour, David E. Graff, Marwin H. S. Segler, Connor W. Coley
TL;DR
The paper argues that standard benchmark splits for forward reaction prediction are inherently in-distribution and thus overstate performance when models are deployed in real-world, out-of-distribution scenarios. It introduces a comprehensive set of OOD evaluation strategies—document- and author-based splits, time-based prospective splits, and reaction-type NameRxn splits—applied to a SMILES-based encoder-decoder transformer trained on the Pistachio dataset. Key findings include substantial performance drops under OOD splits, observable extrapolation limits across reaction classes, and mechanistic insights (e.g., double additions in Grignard Ester and stereochemistry in Heck) that explain some extrapolation behaviors. The work emphasizes that combining multiple splits provides complementary insights into model and data interactions, guiding the development of next-generation reaction discovery tools and more faithful benchmarks that better reflect real-use cases.
Abstract
Deep learning models for anticipating the products of organic reactions have found many use cases, including validating retrosynthetic pathways and constraining synthesis-based molecular design tools. Despite compelling performance on popular benchmark tasks, strange and erroneous predictions sometimes ensue when using these models in practice. The core issue is that common benchmarks test models in an in-distribution setting, whereas many real-world uses for these models are in out-of-distribution settings and require a greater degree of extrapolation. To better understand how current reaction predictors work in out-of-distribution domains, we report a series of more challenging evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled datasets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published in years after those in their training set, mimicking real-world deployment. Finally, we consider extrapolation across reaction classes to reflect what would be required for the discovery of novel reaction types. This panel of tasks can reveal the capabilities and limitations of today's reaction predictors, acting as a crucial first step in the development of tomorrow's next-generation models capable of reaction discovery.
