Table of Contents
Fetching ...

ReactEmbed: A Cross-Domain Framework for Protein-Molecule Representation Learning via Biochemical Reaction Networks

Amitay Sicherman, Kira Radinsky

TL;DR

ReactEmbed tackles the limitation of unimodal protein and molecule representations by constructing a weighted biochemical reaction graph and learning a cross-domain, unified embedding space. Through projection-aware P2U/M2U transformations and a balanced triplet loss with dual negative sampling, it enables zero-shot cross-domain predictions and strong performance across molecular, protein, and interaction tasks. The framework demonstrates state-of-the-art results on 11 benchmarks and proves its practical utility by successfully predicting BBB permeability for protein–lipid nanoparticle complexes, guiding experimental validation such as transferrin-mediated brain delivery. Ablation studies reveal robustness to data quality and input reductions, while real-world deployment underscores the method’s potential for accelerating therapeutic design and targeted delivery. Overall, ReactEmbed provides a versatile, cross-domain representation learning approach that integrates biochemical reaction context to enrich protein and molecular embeddings and enable transferable predictions.

Abstract

The challenge in computational biology and drug discovery lies in creating comprehensive representations of proteins and molecules that capture their intrinsic properties and interactions. Traditional methods often focus on unimodal data, such as protein sequences or molecular structures, limiting their ability to capture complex biochemical relationships. This work enhances these representations by integrating biochemical reactions encompassing interactions between molecules and proteins. By leveraging reaction data alongside pre-trained embeddings from state-of-the-art protein and molecule models, we develop ReactEmbed, a novel method that creates a unified embedding space through contrastive learning. We evaluate ReactEmbed across diverse tasks, including drug-target interaction, protein-protein interaction, protein property prediction, and molecular property prediction, consistently surpassing all current state-of-the-art models. Notably, we showcase ReactEmbed's practical utility through successful implementation in lipid nanoparticle-based drug delivery, enabling zero-shot prediction of blood-brain barrier permeability for protein-nanoparticle complexes. The code and comprehensive database of reaction pairs are available for open use at \href{https://github.com/amitaysicherman/ReactEmbed}{GitHub}.

ReactEmbed: A Cross-Domain Framework for Protein-Molecule Representation Learning via Biochemical Reaction Networks

TL;DR

ReactEmbed tackles the limitation of unimodal protein and molecule representations by constructing a weighted biochemical reaction graph and learning a cross-domain, unified embedding space. Through projection-aware P2U/M2U transformations and a balanced triplet loss with dual negative sampling, it enables zero-shot cross-domain predictions and strong performance across molecular, protein, and interaction tasks. The framework demonstrates state-of-the-art results on 11 benchmarks and proves its practical utility by successfully predicting BBB permeability for protein–lipid nanoparticle complexes, guiding experimental validation such as transferrin-mediated brain delivery. Ablation studies reveal robustness to data quality and input reductions, while real-world deployment underscores the method’s potential for accelerating therapeutic design and targeted delivery. Overall, ReactEmbed provides a versatile, cross-domain representation learning approach that integrates biochemical reaction context to enrich protein and molecular embeddings and enable transferable predictions.

Abstract

The challenge in computational biology and drug discovery lies in creating comprehensive representations of proteins and molecules that capture their intrinsic properties and interactions. Traditional methods often focus on unimodal data, such as protein sequences or molecular structures, limiting their ability to capture complex biochemical relationships. This work enhances these representations by integrating biochemical reactions encompassing interactions between molecules and proteins. By leveraging reaction data alongside pre-trained embeddings from state-of-the-art protein and molecule models, we develop ReactEmbed, a novel method that creates a unified embedding space through contrastive learning. We evaluate ReactEmbed across diverse tasks, including drug-target interaction, protein-protein interaction, protein property prediction, and molecular property prediction, consistently surpassing all current state-of-the-art models. Notably, we showcase ReactEmbed's practical utility through successful implementation in lipid nanoparticle-based drug delivery, enabling zero-shot prediction of blood-brain barrier permeability for protein-nanoparticle complexes. The code and comprehensive database of reaction pairs are available for open use at \href{https://github.com/amitaysicherman/ReactEmbed}{GitHub}.

Paper Structure

This paper contains 29 sections, 8 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of the ReactEmbed framework. Left: Example conversion of a toy reaction dataset containing four biochemical reactions into a weighted reaction graph, where edge weights represent the co-occurrence frequency of entities. Middle: The ReactEmbed model architecture shows how domain-specific pre-trained embeddings are projected into a unified space using P2U (Protein to Unified) and M2U (Molecule to Unified) transformations. Right: Illustration of triplet generation for contrastive learning, where for a given protein anchor, we show both intra-domain negative sampling (another protein) and cross-domain negative sampling (a molecule), with the loss function working to minimize the distance to positive examples while maximizing distance to negatives.
  • Figure 2: Zero-shot cross-domain prediction framework for blood-brain barrier permeability (BBBP). Top: Traditional BBBP dataset containing molecular data and their BBB penetration labels. Center: ReactEmbed converts molecules into a unified protein-molecule embedding space, where a classification model is trained on the molecular BBBP data. Bottom: Zero-shot prediction - given a new protein, ReactEmbed projects it into the unified space where the trained classifier can make BBB permeability predictions without requiring protein-specific training data.
  • Figure 3: Distribution of BBB penetration probability scores across 544 proteins associated with extracellular vesicle transport and transport processes. The blue dashed line represents the baseline LNP formulation score (0.74), while the red dashed line indicates the probability score achieved with Transferrin-modified LNP (0.96).