Table of Contents
Fetching ...

log-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling

Xiao Hu, Ziqi Chen, Bo Peng, Daniel Adu-Ampratwum, Xia Ning

TL;DR

The paper addresses the challenge of accurately predicting chemical reaction yields across diverse reaction types. It introduces log-RRIM, a locally-to-globally structured graph-transformer that explicitly models reagent–reaction center interactions via a cross-attention mechanism and learns hierarchical molecule representations (MRL) before interacting (MIT) and aggregating (RIA) to predict yields. Empirical results show log-RRIM, including a pretraining-free variant, achieves superior or competitive performance on USPTO500MT, CJHIF, and Buchwald–Hartwig datasets, with notable gains in medium-to-high-yield reactions and improved sensitivity to small fragment changes. The work underscores the value of task-specific architectural design and interactive molecular modeling for practical reaction planning, while noting current limitations such as the need for clearly defined reaction centers and potential avenues for broader chemical knowledge integration and multi-task learning.

Abstract

Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. A key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM also implements a local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions. Through this hierarchical process, log-RRIM effectively captures how different molecular fragments contribute to and influence the overall reaction yield, regardless of their size variations. log-RRIM shows superior performance in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. The framework's sophisticated modeling of reactant-reagent interactions and precise capture of molecular fragment contributions make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/Yield_log_RRIM.

log-RRIM: Yield Prediction via Local-to-global Reaction Representation Learning and Interaction Modeling

TL;DR

The paper addresses the challenge of accurately predicting chemical reaction yields across diverse reaction types. It introduces log-RRIM, a locally-to-globally structured graph-transformer that explicitly models reagent–reaction center interactions via a cross-attention mechanism and learns hierarchical molecule representations (MRL) before interacting (MIT) and aggregating (RIA) to predict yields. Empirical results show log-RRIM, including a pretraining-free variant, achieves superior or competitive performance on USPTO500MT, CJHIF, and Buchwald–Hartwig datasets, with notable gains in medium-to-high-yield reactions and improved sensitivity to small fragment changes. The work underscores the value of task-specific architectural design and interactive molecular modeling for practical reaction planning, while noting current limitations such as the need for clearly defined reaction centers and potential avenues for broader chemical knowledge integration and multi-task learning.

Abstract

Accurate prediction of chemical reaction yields is crucial for optimizing organic synthesis, potentially reducing time and resources spent on experimentation. With the rise of artificial intelligence (AI), there is growing interest in leveraging AI-based methods to accelerate yield predictions without conducting in vitro experiments. We present log-RRIM, an innovative graph transformer-based framework designed for predicting chemical reaction yields. A key feature of log-RRIM is its integration of a cross-attention mechanism that focuses on the interplay between reagents and reaction centers. This design reflects a fundamental principle in chemical reactions: the crucial role of reagents in influencing bond-breaking and formation processes, which ultimately affect reaction yields. log-RRIM also implements a local-to-global reaction representation learning strategy. This approach initially captures detailed molecule-level information and then models and aggregates intermolecular interactions. Through this hierarchical process, log-RRIM effectively captures how different molecular fragments contribute to and influence the overall reaction yield, regardless of their size variations. log-RRIM shows superior performance in our experiments, especially for medium to high-yielding reactions, proving its reliability as a predictor. The framework's sophisticated modeling of reactant-reagent interactions and precise capture of molecular fragment contributions make it a valuable tool for reaction planning and optimization in chemical synthesis. The data and codes of log-RRIM are accessible through https://github.com/ninglab/Yield_log_RRIM.

Paper Structure

This paper contains 39 sections, 12 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview reactions yield distributions of the two datasets
  • Figure 2: Pipeline of $\mathop{\mathsf{log\text{-}RRIM}}\limits$
  • Figure 3: Performance comparison of $\mathop{\mathsf{log\text{-}RRIM}}\limits$ and $\mathop{\mathsf{T5Chem}}\limits$ across yield ranges on the USPTO500MT testing set. Left y-axis: MAE of predicted yields. Right y-axis: percentage of reactions in the testing set for each yield range. 5% significance level: * for $\text{p-values}<0.05$, ** for $\text{p-values}<0.005$, *** for $\text{p-values}<0.0005$.
  • Figure 4: Model performance on reaction pairs categorized by similarity. The left y-axis displays the number of reaction pairs on a logarithmic scale. Grey bars indicate the number of reaction pairs within each similarity range. Green bars represent the number of reaction pairs where $\mathop{\mathsf{log\text{-}RRIM_{b}}}\limits$ predicts more accurately than $\mathop{\mathsf{T5Chem}}\limits$. The right y-axis shows the percentage of reaction pairs with more accurate predictions by $\mathop{\mathsf{log\text{-}RRIM_{b}}}\limits$ relative to the total number of reactions in each similarity range, as depicted by the red line.
  • Figure 5: Cases analysis on the USPTO500MT dataset. Each reaction is reported with reactants, reagents, products, and the ground-truth and predicted yields by $\mathop{\mathsf{T5Chem}}\limits$ and $\mathop{\mathsf{log\text{-}RRIM_{b}}}\limits$. $AE$ in parentheses represents the absolute error between the predicted and ground-truth yields. $\Delta$ in parentheses represents the change of the ground-truth and predicted yields in the second reaction to the corresponding value in the first reaction.
  • ...and 4 more figures