Table of Contents
Fetching ...

Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

Karina Zadorozhny, Kangway V. Chuang, Bharath Sathappan, Ewan Wallace, Vishnu Sresht, Colin A. Grambow

TL;DR

SQRL reframes molecular activity prediction as learning relative differences between nearby compounds by using similarity-aware pairings. It combines a similarity-thresholded data-matching strategy with a learnable relative representation to train models on relative differences $\Delta y_{ij}$, improving generalization in low-data regimes and capturing activity cliffs. Across 30 MoleculeACE tasks and internal targets, SQRL yields consistent improvements for deep models, particularly GNNs and pretrained transformers, while relying on informative local pairs rather than indiscriminate all-pairs training. This approach provides a practical paradigm for more robust, similarity-aware drug discovery modeling.

Abstract

Accurate prediction of molecular activities is crucial for efficient drug discovery, yet remains challenging due to limited and noisy datasets. We introduce Similarity-Quantized Relative Learning (SQRL), a learning framework that reformulates molecular activity prediction as relative difference learning between structurally similar pairs of compounds. SQRL uses precomputed molecular similarities to enhance training of graph neural networks and other architectures, and significantly improves accuracy and generalization in low-data regimes common in drug discovery. We demonstrate its broad applicability and real-world potential through benchmarking on public datasets as well as proprietary industry data. Our findings demonstrate that leveraging similarity-aware relative differences provides an effective paradigm for molecular activity prediction.

Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

TL;DR

SQRL reframes molecular activity prediction as learning relative differences between nearby compounds by using similarity-aware pairings. It combines a similarity-thresholded data-matching strategy with a learnable relative representation to train models on relative differences , improving generalization in low-data regimes and capturing activity cliffs. Across 30 MoleculeACE tasks and internal targets, SQRL yields consistent improvements for deep models, particularly GNNs and pretrained transformers, while relying on informative local pairs rather than indiscriminate all-pairs training. This approach provides a practical paradigm for more robust, similarity-aware drug discovery modeling.

Abstract

Accurate prediction of molecular activities is crucial for efficient drug discovery, yet remains challenging due to limited and noisy datasets. We introduce Similarity-Quantized Relative Learning (SQRL), a learning framework that reformulates molecular activity prediction as relative difference learning between structurally similar pairs of compounds. SQRL uses precomputed molecular similarities to enhance training of graph neural networks and other architectures, and significantly improves accuracy and generalization in low-data regimes common in drug discovery. We demonstrate its broad applicability and real-world potential through benchmarking on public datasets as well as proprietary industry data. Our findings demonstrate that leveraging similarity-aware relative differences provides an effective paradigm for molecular activity prediction.
Paper Structure (27 sections, 3 equations, 6 figures, 3 tables)

This paper contains 27 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Leveraging local structural information enhances predictive performance.Top: Incorporating neighbors only up to a certain distance threshold $\alpha$ improves MAE ($\downarrow$). Bottom: Pairwise distance distributions of training data (overlaid for all 30 MoleculeACE tasks) with greater skewness and kurtosis yield the best performance and a wider range of acceptable values of $\alpha$.
  • Figure 2: Training data sizes for each task in MoleculeACE.
  • Figure 3: Molecular pairs obtained by the data matching procedure described in Section \ref{['sec:methods']} at different Tanimoto distance thresholds $\alpha$ for MoleculeACE task CHEMBL1862_Ki.
  • Figure 4: Molecular pairs of activity cliff molecules obtained by the data matching procedure described in Section \ref{['sec:methods']} at different Tanimoto distance thresholds $\alpha$ for MoleculeACE task CHEMBL1862_Ki.
  • Figure 5: Leveraging local structural information enhances predictive performance. MAE ($\downarrow$) as a function of distance threshold $\alpha$ for several additional distance metrics compared to Figure \ref{['fig:dist_sweep']}, as well as pairwise distance distributions for each metric. Tanimoto: Tanimoto (Jaccard) distance between binary Morgan fingerprints. Tanimoto (count FP): Tanimoto (Jaccard) distance between count-based Morgan fingerprints. Substruct: Tanimoto (Jaccard) distance between substructure count vectors using a list of 1242 predefined substructures from ehrlich2012. MCS: Distance metric based on maximum common substructure (MCS) defined as $1 - 2 N_\text{MCS} / (N_i + N_j)$ where $N_\text{MCS}$ is the number of atoms in the MCS, $N_i$ is the number of atoms in molecule $i$, and $N_j$ is the number of atoms in molecule $j$. COATI kaufman2023coati, MolCLR wang2022molclr, Uni-Mol zhou2023unimol: Euclidean distances between neural network embeddings obtained with these pre-trained models.
  • ...and 1 more figures