Table of Contents
Fetching ...

Are we making much progress? Revisiting chemical reaction yield prediction from an imbalanced regression perspective

Yihong Ma, Xiaobao Huang, Bozhao Nan, Nuno Moniz, Xiangliang Zhang, Olaf Wiest, Nitesh V. Chawla

TL;DR

This paper tackles the key challenge of predicting chemical reaction yields, emphasizing that high-yield predictions are underrepresented in real data. It reframes reaction yield prediction as an imbalanced regression problem and shows that simple cost-sensitive re-weighting methods can substantially improve high-yield predictive performance with modest or negligible impact on overall metrics. Through extensive experiments on three real-world datasets and multiple models, the authors demonstrate that focal loss and label distribution smoothing yield large gains in few-shot (high-yield) regions, addressing a critical gap for synthesis planning. The findings suggest a practical path forward for yield prediction research and highlight the importance of accounting for data imbalance in regression tasks within chemistry and related domains.

Abstract

The yield of a chemical reaction quantifies the percentage of the target product formed in relation to the reactants consumed during the chemical reaction. Accurate yield prediction can guide chemists toward selecting high-yield reactions during synthesis planning, offering valuable insights before dedicating time and resources to wet lab experiments. While recent advancements in yield prediction have led to overall performance improvement across the entire yield range, an open challenge remains in enhancing predictions for high-yield reactions, which are of greater concern to chemists. In this paper, we argue that the performance gap in high-yield predictions results from the imbalanced distribution of real-world data skewed towards low-yield reactions, often due to unreacted starting materials and inherent ambiguities in the reaction processes. Despite this data imbalance, existing yield prediction methods continue to treat different yield ranges equally, assuming a balanced training distribution. Through extensive experiments on three real-world yield prediction datasets, we emphasize the urgent need to reframe reaction yield prediction as an imbalanced regression problem. Finally, we demonstrate that incorporating simple cost-sensitive re-weighting methods can significantly enhance the performance of yield prediction models on underrepresented high-yield regions.

Are we making much progress? Revisiting chemical reaction yield prediction from an imbalanced regression perspective

TL;DR

This paper tackles the key challenge of predicting chemical reaction yields, emphasizing that high-yield predictions are underrepresented in real data. It reframes reaction yield prediction as an imbalanced regression problem and shows that simple cost-sensitive re-weighting methods can substantially improve high-yield predictive performance with modest or negligible impact on overall metrics. Through extensive experiments on three real-world datasets and multiple models, the authors demonstrate that focal loss and label distribution smoothing yield large gains in few-shot (high-yield) regions, addressing a critical gap for synthesis planning. The findings suggest a practical path forward for yield prediction research and highlight the importance of accounting for data imbalance in regression tasks within chemistry and related domains.

Abstract

The yield of a chemical reaction quantifies the percentage of the target product formed in relation to the reactants consumed during the chemical reaction. Accurate yield prediction can guide chemists toward selecting high-yield reactions during synthesis planning, offering valuable insights before dedicating time and resources to wet lab experiments. While recent advancements in yield prediction have led to overall performance improvement across the entire yield range, an open challenge remains in enhancing predictions for high-yield reactions, which are of greater concern to chemists. In this paper, we argue that the performance gap in high-yield predictions results from the imbalanced distribution of real-world data skewed towards low-yield reactions, often due to unreacted starting materials and inherent ambiguities in the reaction processes. Despite this data imbalance, existing yield prediction methods continue to treat different yield ranges equally, assuming a balanced training distribution. Through extensive experiments on three real-world yield prediction datasets, we emphasize the urgent need to reframe reaction yield prediction as an imbalanced regression problem. Finally, we demonstrate that incorporating simple cost-sensitive re-weighting methods can significantly enhance the performance of yield prediction models on underrepresented high-yield regions.
Paper Structure (15 sections, 3 equations, 1 figure, 2 tables)

This paper contains 15 sections, 3 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: A comparison between yield distributions (left) and test error distributions (right) on three real-world datasets.

Theorems & Definitions (2)

  • Definition 1: Reaction yield prediction
  • Definition 2: Imbalanced regression