Revealing the Relationship Between Publication Bias and Chemical Reactivity with Contrastive Learning

Wenhao Gao; Priyanka Raghavan; Ron Shprints; Connor W. Coley

Revealing the Relationship Between Publication Bias and Chemical Reactivity with Contrastive Learning

Wenhao Gao, Priyanka Raghavan, Ron Shprints, Connor W. Coley

TL;DR

This work addresses publication bias in substrate-scope tables by introducing ContraScope, a substrate-scope contrastive learning framework that learns aryl-halide embeddings from literature-driven substrate groupings. Using a triplet loss on a graph neural network, ContraScope couples substrates within the same scope with distances modulated by yield differences, revealing embeddings that correlate with local reactivity descriptors calculated by DFT. The embeddings support qualitative and quantitative insights into reactivity and enable downstream tasks such as yield and regioselectivity prediction, suggesting that publication trends encode meaningful chemical information. The approach offers a novel perspective on leveraging literature data for molecular representation learning and hints at extending this strategy to other substrate classes while acknowledging limitations and areas for methodological refinement.

Abstract

A synthetic method's substrate tolerance and generality are often showcased in a "substrate scope" table. However, substrate selection exhibits a frequently discussed publication bias: unsuccessful experiments or low-yielding results are rarely reported. In this work, we explore more deeply the relationship between such publication bias and chemical reactivity beyond the simple analysis of yield distributions using a novel neural network training strategy, substrate scope contrastive learning. By treating reported substrates as positive samples and non-reported substrates as negative samples, our contrastive learning strategy teaches a model to group molecules within a numerical embedding space, based on historical trends in published substrate scope tables. Training on 20,798 aryl halides in the CAS Content Collection$^{\text{TM}}$, spanning thousands of publications from 2010-2015, we demonstrate that the learned embeddings exhibit a correlation with physical organic reactivity descriptors through both intuitive visualizations and quantitative regression analyses. Additionally, these embeddings are applicable to various reaction modeling tasks like yield prediction and regioselectivity prediction, underscoring the potential to use historical reaction data as a pre-training task. This work not only presents a chemistry-specific machine learning training strategy to learn from literature data in a new way, but also represents a unique approach to uncover trends in chemical reactivity reflected by trends in substrate selection in publications.

Revealing the Relationship Between Publication Bias and Chemical Reactivity with Contrastive Learning

TL;DR

Abstract

, spanning thousands of publications from 2010-2015, we demonstrate that the learned embeddings exhibit a correlation with physical organic reactivity descriptors through both intuitive visualizations and quantitative regression analyses. Additionally, these embeddings are applicable to various reaction modeling tasks like yield prediction and regioselectivity prediction, underscoring the potential to use historical reaction data as a pre-training task. This work not only presents a chemistry-specific machine learning training strategy to learn from literature data in a new way, but also represents a unique approach to uncover trends in chemical reactivity reflected by trends in substrate selection in publications.

Paper Structure (22 sections, 3 equations, 29 figures)

This paper contains 22 sections, 3 equations, 29 figures.

Introduction
Substrate Scope Contrastive Learning
Results
Visualization and intuitive investigation of the learned aryl halide embeddings
Regression analysis reveals the relationship between the learned embeddings and reactivity descriptors
Potential downstream application of the learned embeddings
Discussion
Methods
Supporting Information
Code and Data Availability
The misalignment of current pre-training methods and functionalities
Statistics of the dataset
Learning curves and hyper-parameter tuning
Details of reactivity descriptors
Additional Results
...and 7 more sections

Figures (29)

Figure 1: (A) Substrate scope tables for training the network are curated from the CAS Content Collection, focusing on aryl halides with their associated yields. The two substrate scopes $i$ and $j$ shown in the figure are real samples from the databaseguastavino2014roommovahed2014one. (B) Overview of substrate scope contrastive learning: The hypothesis behind our training strategy is that publication bias in chemical reactions reveals more subtle reactivity trends than the pronounced inclination towards higher yields as depicted in the histogram. A message-passing neural network operates on molecular graphs to derive atomic embeddings. Embeddings of substrates from the same scope table are pulled together, while embeddings of substrates from distinct scope tables are forced apart. The training aims to provide atomic embeddings that inherently reflect the grouping of reported substrate scopes table, which creates a vector space showcased to be aligned with reactivity trends. (C) The embeddings obtained after contrastive learning can then be analyzed to understand what has been learned and/or used as representations of molecules in various reactivity-related tasks.
Figure 2: (A) A two-dimensional projection of learned embeddings using t-SNEhinton2002stochastic for intuitive investigation of aryl halide embeddings obtained through substrate scope contrastive learning; each point denotes an aryl halide and is colored by halide type. Neighborhoods of similar aryl halides in the embedding space are annotated with structures. (B) t-SNE visualizations of the learned embeddings, colored by traditional reactivity descriptor values.
Figure 3: Relationship between the learned embeddings and conventional reactivity descriptors. (A and B) Regression performance ($r^2$) when learning to predict local (B) and global (C) physical organic chemistry descriptors from different input representations using linear models; negative values are excluded from the plot. (C) Analysis of support vector machine (SVM)hearst1998support regression performance as a function of dataset size, highlighting the efficacy of our embeddings in scenarios with limited data.
Figure 4: Validation and application of learned embeddings in various downstream tasks. (A) Application of ContraScope embeddings to predict the yield of aryl bromides in cross-coupling reactionskariofillis2022using. We show its predictive performance via leave-one-out validation compared with other common featurizations and a scatter plot mapping our predictions against experimental yields. (B) Application of ContraScope embeddings to predict the regioselectivity in arylation reactions of fluorobenzenes, with the model trained on penta- and trifluoronitrobenzenes to anticipate reactivity in tetrafluoronitrobenzenes, highlighting the reaction centers confirmed by experiments and annotated with prediction outcomescargill2010palladium.
Figure S1: Illustration of various substituted benzonitriles. Despite the structural similarity of the benzonitrile backbone across all molecules, the diverse electronic properties of the substituents significantly influence the molecular functionality. This exemplifies the limitation of some pre-training methods in graph neural networks that may not distinguish between these functional nuances. The corresponding SMILES notations are provided below each molecule, indicating that the highlighted substituents are also interchangeable in string representation, which also brings into question the reasonableness of string-based pre-training methodologies.
...and 24 more figures

Revealing the Relationship Between Publication Bias and Chemical Reactivity with Contrastive Learning

TL;DR

Abstract

Revealing the Relationship Between Publication Bias and Chemical Reactivity with Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (29)