Table of Contents
Fetching ...

Challenging reaction prediction models to generalize to novel chemistry

John Bradshaw, Anji Zhang, Babak Mahjour, David E. Graff, Marwin H. S. Segler, Connor W. Coley

TL;DR

The paper argues that standard benchmark splits for forward reaction prediction are inherently in-distribution and thus overstate performance when models are deployed in real-world, out-of-distribution scenarios. It introduces a comprehensive set of OOD evaluation strategies—document- and author-based splits, time-based prospective splits, and reaction-type NameRxn splits—applied to a SMILES-based encoder-decoder transformer trained on the Pistachio dataset. Key findings include substantial performance drops under OOD splits, observable extrapolation limits across reaction classes, and mechanistic insights (e.g., double additions in Grignard Ester and stereochemistry in Heck) that explain some extrapolation behaviors. The work emphasizes that combining multiple splits provides complementary insights into model and data interactions, guiding the development of next-generation reaction discovery tools and more faithful benchmarks that better reflect real-use cases.

Abstract

Deep learning models for anticipating the products of organic reactions have found many use cases, including validating retrosynthetic pathways and constraining synthesis-based molecular design tools. Despite compelling performance on popular benchmark tasks, strange and erroneous predictions sometimes ensue when using these models in practice. The core issue is that common benchmarks test models in an in-distribution setting, whereas many real-world uses for these models are in out-of-distribution settings and require a greater degree of extrapolation. To better understand how current reaction predictors work in out-of-distribution domains, we report a series of more challenging evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled datasets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published in years after those in their training set, mimicking real-world deployment. Finally, we consider extrapolation across reaction classes to reflect what would be required for the discovery of novel reaction types. This panel of tasks can reveal the capabilities and limitations of today's reaction predictors, acting as a crucial first step in the development of tomorrow's next-generation models capable of reaction discovery.

Challenging reaction prediction models to generalize to novel chemistry

TL;DR

The paper argues that standard benchmark splits for forward reaction prediction are inherently in-distribution and thus overstate performance when models are deployed in real-world, out-of-distribution scenarios. It introduces a comprehensive set of OOD evaluation strategies—document- and author-based splits, time-based prospective splits, and reaction-type NameRxn splits—applied to a SMILES-based encoder-decoder transformer trained on the Pistachio dataset. Key findings include substantial performance drops under OOD splits, observable extrapolation limits across reaction classes, and mechanistic insights (e.g., double additions in Grignard Ester and stereochemistry in Heck) that explain some extrapolation behaviors. The work emphasizes that combining multiple splits provides complementary insights into model and data interactions, guiding the development of next-generation reaction discovery tools and more faithful benchmarks that better reflect real-use cases.

Abstract

Deep learning models for anticipating the products of organic reactions have found many use cases, including validating retrosynthetic pathways and constraining synthesis-based molecular design tools. Despite compelling performance on popular benchmark tasks, strange and erroneous predictions sometimes ensue when using these models in practice. The core issue is that common benchmarks test models in an in-distribution setting, whereas many real-world uses for these models are in out-of-distribution settings and require a greater degree of extrapolation. To better understand how current reaction predictors work in out-of-distribution domains, we report a series of more challenging evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled datasets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published in years after those in their training set, mimicking real-world deployment. Finally, we consider extrapolation across reaction classes to reflect what would be required for the discovery of novel reaction types. This panel of tasks can reveal the capabilities and limitations of today's reaction predictors, acting as a crucial first step in the development of tomorrow's next-generation models capable of reaction discovery.
Paper Structure (33 sections, 1 equation, 12 figures, 4 tables)

This paper contains 33 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: \ref{['fig:overview:reactionPrediction']} Reaction prediction, in the context of this manuscript, is the task of predicting the major product(s) of a reaction given the reactants. (Note that by "reaction" we mean specific reported reaction examples, rather than generic reaction "types" or "classes" that cover a large group of related specific examples—we will come back to the concept of reaction types in Section \ref{['sect:nameRxn']}.) Chemical reaction datasets are often curated from academic or patent literature, and so each reaction is associated with a set of hidden metadata (the predictive model does not see this), such as the reaction's associated patent document, its authors, its publication date, the assignee/organization that filed the patent, etc. \ref{['fig:overview:OOD']} Typically, reaction predictors are assessed in an in-distribution setting, meaning that the training and test reactions come from the same distributions. However, in the real-world, reaction prediction models are often deployed on out-of-distribution data, a setup that we will discuss how to replicate.
  • Figure 2: \ref{['fig:docAuth:cartoon']} Reaction datasets are formed by authors coming together and writing documents, which contain many (often similar) reactions. Evaluating a reaction predictor on train/test sets that account for this structure provide different accuracy scores. (Note that in this paper we clean and deduplicate reactions before creating the splits, such that a reaction is only associated with one document---see Section \ref{['sect:dataset']}.) \ref{['fig:docAuth:results']} Top-1, 3, & 5 accuracies when doing reaction- (i.e., random), document-, and author-based splits.
  • Figure 3: \ref{['fig:timeSplits']} Top-1 accuracies of reaction predictors trained up to different time cutoffs (different colors) when evaluated on held-out test sets for each year (x-axis). For instance, the line in the lightest shade, marked "1996", reports the top-1 accuracy for a reaction predictor trained on reactions that were reported up to 1996 (inclusive). The dashed line indicates model performance when the model is "extrapolating"---meaning that the test set year is beyond the model's time cutoff. Note that we control for training set size so each model sees the same number of reactions in training (the absolute performance of the model is therefore lower than when training on all available data up to a given year). Further details on experimental setup and additional results can be found in Section \ref{['sect:timeBasedSplitsMethods']} and \ref{['sect:moreTimeSplits']}. \ref{['fig:bhOverTime']} Performance of the models trained on different time splits on a separate, static test set of Buchwald–Hartwig reactions. The blue solid line shows the top-1 accuracies (left-hand axis), while the dotted gray line shows the number of Buchwald–Hartwig reactions in the models' training sets (right-hand axis).
  • Figure 4: Top-1, 3, and 5 accuracies for reaction predictors evaluated on different reaction-type splits. Each column shows the accuracy on a held-out set of a particular reaction class both (a) when seeing 1000 separate examples of the same reaction type during training (gray circles, ; intrinsic difficulty) and (b) when seeing no reactions of that type during training (blue arrows, ; extrapolation difficulty). The gray dashed horizontal line shows the accuracy of a reaction predictor evaluated on an in-distribution test set (i.e., containing many different reaction classes). Note that we remove all uncategorized reactions (NameRxn class "0.0") when creating our datasets.
  • Figure 5: We investigate reasons for the contrasting performance in the different NameRxn splits. \ref{['fig:nameRxnExplanation:suz']} For the Chloro Suzuki and Triflyloxy Suzuki splits, we assess whether the large number of other Suzuki reactions present can explain the good extrapolative performance. Namely, to evaluate on our specific Chloro and Triflyloxy Suzuki test sets, we create three different training sets (as shown by cartoon, left): (i) reactions from all classes (including separate reactions from the same specific Suzuki class); (ii) reactions only from other specific Suzuki reaction classes (and non-Suzuki reactions); and (iii) non-Suzuki reactions only. Results are shown on the right. The different bars show the accuracy for the different cases (the square color boxes in the x-axis labels indicate the reaction classes used in training the respective models). \ref{['fig:nameRxnExplanation:grigest']}For the Grignard Ester split we notice that the model does particularly poorly on a double addition subset, but such a reaction can actually be expressed as two single additions and that when we allow our model to do two rounds of predictions (i.e., when we feed the predicted product from the first round in as an input the second time around) performance improves. \ref{['fig:nameRxnExplanation:heck']} For the Heck split we see that the model particularly struggles with the stereochemistry and regioselectivity present in these reactions (see text for further details).
  • ...and 7 more figures