Table of Contents
Fetching ...

Contextual Molecule Representation Learning from Chemical Reaction Knowledge

Han Tang, Shikun Feng, Bicheng Lin, Yuyan Ni, JIngjing Liu, Wei-Ying Ma, Yanyan Lan

TL;DR

This work introduces REMO, a self-supervised molecular representation learning framework that leverages chemical reaction context to mitigate the limitations of traditional single-molecule masked modeling. By pre-training encoders on 1.72 million reactions with two objectives—Masked Reaction Centre Reconstruction and Reaction Centre Identification—REMO learns contextualized substructure semantics aligned with reaction centres. Across activity cliffs, drug-drug interactions, and reaction-type classification, REMO achieves state-of-the-art performance and even surpasses fingerprint-based baselines on activity-cliff benchmarks, demonstrating the practical value of reaction-aware pre-training. The results show that integrating reaction knowledge into self-supervised learning yields more transferable and chemically meaningful representations with reduced reliance on large labeled datasets.

Abstract

In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms within molecules, which brings insurmountable complexity to the masking-reconstruction paradigm. To tackle this challenge, we introduce REMO, a self-supervised learning framework that takes advantage of well-defined atom-combination rules in common chemistry. Specifically, REMO pre-trains graph/Transformer encoders on 1.7 million known chemical reactions in the literature. We propose two pre-training objectives: Masked Reaction Centre Reconstruction (MRCR) and Reaction Centre Identification (RCI). REMO offers a novel solution to MRL by exploiting the underlying shared patterns in chemical reactions as \textit{context} for pre-training, which effectively infers meaningful representations of common chemistry knowledge. Such contextual representations can then be utilized to support diverse downstream molecular tasks with minimum finetuning, such as affinity prediction and drug-drug interaction prediction. Extensive experimental results on MoleculeACE, ACNet, drug-drug interaction (DDI), and reaction type classification show that across all tested downstream tasks, REMO outperforms the standard baseline of single-molecule masked modeling used in current MRL. Remarkably, REMO is the pioneering deep learning model surpassing fingerprint-based methods in activity cliff benchmarks.

Contextual Molecule Representation Learning from Chemical Reaction Knowledge

TL;DR

This work introduces REMO, a self-supervised molecular representation learning framework that leverages chemical reaction context to mitigate the limitations of traditional single-molecule masked modeling. By pre-training encoders on 1.72 million reactions with two objectives—Masked Reaction Centre Reconstruction and Reaction Centre Identification—REMO learns contextualized substructure semantics aligned with reaction centres. Across activity cliffs, drug-drug interactions, and reaction-type classification, REMO achieves state-of-the-art performance and even surpasses fingerprint-based baselines on activity-cliff benchmarks, demonstrating the practical value of reaction-aware pre-training. The results show that integrating reaction knowledge into self-supervised learning yields more transferable and chemically meaningful representations with reduced reliance on large labeled datasets.

Abstract

In recent years, self-supervised learning has emerged as a powerful tool to harness abundant unlabelled data for representation learning and has been broadly adopted in diverse areas. However, when applied to molecular representation learning (MRL), prevailing techniques such as masked sub-unit reconstruction often fall short, due to the high degree of freedom in the possible combinations of atoms within molecules, which brings insurmountable complexity to the masking-reconstruction paradigm. To tackle this challenge, we introduce REMO, a self-supervised learning framework that takes advantage of well-defined atom-combination rules in common chemistry. Specifically, REMO pre-trains graph/Transformer encoders on 1.7 million known chemical reactions in the literature. We propose two pre-training objectives: Masked Reaction Centre Reconstruction (MRCR) and Reaction Centre Identification (RCI). REMO offers a novel solution to MRL by exploiting the underlying shared patterns in chemical reactions as \textit{context} for pre-training, which effectively infers meaningful representations of common chemistry knowledge. Such contextual representations can then be utilized to support diverse downstream molecular tasks with minimum finetuning, such as affinity prediction and drug-drug interaction prediction. Extensive experimental results on MoleculeACE, ACNet, drug-drug interaction (DDI), and reaction type classification show that across all tested downstream tasks, REMO outperforms the standard baseline of single-molecule masked modeling used in current MRL. Remarkably, REMO is the pioneering deep learning model surpassing fingerprint-based methods in activity cliff benchmarks.
Paper Structure (29 sections, 6 equations, 15 figures, 7 tables)

This paper contains 29 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: a: Adding either carbon or oxygen to the left molecule can lead to the formation of a valid molecule. However, these resulting molecules exhibit distinct reactions and consequently possess varying properties. b: Example of an activity cliff pair on the target Thrombin(F2), where $\mathrm{K}_\mathrm{i}$ is to measure the equilibrium binding affinity for a ligand that reduces the activity of its binding partner.
  • Figure 2: a: Traditional masked language model on a single model with objective $P(z|\mathcal{Z})$, where $z$ is the masked substructure and $\mathcal{Z}$ stands for the remaining ones. b: Our proposed Masked Reaction Centre Reconstruction with objective $P(z|\mathcal{Z},\mathcal{R})$, where $\mathcal{R}$ denotes other reactants in chemical reactions.
  • Figure 3: Architecture of REMO. a: An example chemical reaction consisting of two reactants and one product, where the primary reactant has two reaction centres (marked in red). b: Illustrations of two pre-training tasks in REMO, i.e., Masked Reaction Centre Reconstruction and Reaction Centre Identification, where $,$ refer to feature concatenation
  • Figure 4: The information entropy of reconstructing reaction centres from reactions
  • Figure 5: RMSE of REMOM-$\mathrm{Graphormer}$, REMOI-$\mathrm{Graphormer}$ and ECFP+SVM on MoleculeACE
  • ...and 10 more figures