Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

Karina Zadorozhny; Kangway V. Chuang; Bharath Sathappan; Ewan Wallace; Vishnu Sresht; Colin A. Grambow

Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

Karina Zadorozhny, Kangway V. Chuang, Bharath Sathappan, Ewan Wallace, Vishnu Sresht, Colin A. Grambow

TL;DR

SQRL reframes molecular activity prediction as learning relative differences between nearby compounds by using similarity-aware pairings. It combines a similarity-thresholded data-matching strategy with a learnable relative representation to train models on relative differences $\Delta y_{ij}$, improving generalization in low-data regimes and capturing activity cliffs. Across 30 MoleculeACE tasks and internal targets, SQRL yields consistent improvements for deep models, particularly GNNs and pretrained transformers, while relying on informative local pairs rather than indiscriminate all-pairs training. This approach provides a practical paradigm for more robust, similarity-aware drug discovery modeling.

Abstract

Accurate prediction of molecular activities is crucial for efficient drug discovery, yet remains challenging due to limited and noisy datasets. We introduce Similarity-Quantized Relative Learning (SQRL), a learning framework that reformulates molecular activity prediction as relative difference learning between structurally similar pairs of compounds. SQRL uses precomputed molecular similarities to enhance training of graph neural networks and other architectures, and significantly improves accuracy and generalization in low-data regimes common in drug discovery. We demonstrate its broad applicability and real-world potential through benchmarking on public datasets as well as proprietary industry data. Our findings demonstrate that leveraging similarity-aware relative differences provides an effective paradigm for molecular activity prediction.

Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

TL;DR

, improving generalization in low-data regimes and capturing activity cliffs. Across 30 MoleculeACE tasks and internal targets, SQRL yields consistent improvements for deep models, particularly GNNs and pretrained transformers, while relying on informative local pairs rather than indiscriminate all-pairs training. This approach provides a practical paradigm for more robust, similarity-aware drug discovery modeling.

Abstract

Paper Structure (27 sections, 3 equations, 6 figures, 3 tables)

This paper contains 27 sections, 3 equations, 6 figures, 3 tables.

Introduction and Background
Related Work
Molecular property prediction.
Activity cliff prediction.
Metric, similarity, and few-shot learning.
Relative prediction.
Similarity-Thresholded Relative Representation
Problem formulation.
Dataset matching.
Relative representation.
Experimental Results
Experimental setup
Models.
Distance metrics.
Datasets.
...and 12 more sections

Figures (6)

Figure 1: Leveraging local structural information enhances predictive performance.Top: Incorporating neighbors only up to a certain distance threshold $\alpha$ improves MAE ($\downarrow$). Bottom: Pairwise distance distributions of training data (overlaid for all 30 MoleculeACE tasks) with greater skewness and kurtosis yield the best performance and a wider range of acceptable values of $\alpha$.
Figure 2: Training data sizes for each task in MoleculeACE.
Figure 3: Molecular pairs obtained by the data matching procedure described in Section \ref{['sec:methods']} at different Tanimoto distance thresholds $\alpha$ for MoleculeACE task CHEMBL1862_Ki.
Figure 4: Molecular pairs of activity cliff molecules obtained by the data matching procedure described in Section \ref{['sec:methods']} at different Tanimoto distance thresholds $\alpha$ for MoleculeACE task CHEMBL1862_Ki.
Figure 5: Leveraging local structural information enhances predictive performance. MAE ($\downarrow$) as a function of distance threshold $\alpha$ for several additional distance metrics compared to Figure \ref{['fig:dist_sweep']}, as well as pairwise distance distributions for each metric. Tanimoto: Tanimoto (Jaccard) distance between binary Morgan fingerprints. Tanimoto (count FP): Tanimoto (Jaccard) distance between count-based Morgan fingerprints. Substruct: Tanimoto (Jaccard) distance between substructure count vectors using a list of 1242 predefined substructures from ehrlich2012. MCS: Distance metric based on maximum common substructure (MCS) defined as $1 - 2 N_\text{MCS} / (N_i + N_j)$ where $N_\text{MCS}$ is the number of atoms in the MCS, $N_i$ is the number of atoms in molecule $i$, and $N_j$ is the number of atoms in molecule $j$. COATI kaufman2023coati, MolCLR wang2022molclr, Uni-Mol zhou2023unimol: Euclidean distances between neural network embeddings obtained with these pre-trained models.
...and 1 more figures

Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

TL;DR

Abstract

Similarity-Quantized Relative Difference Learning for Improved Molecular Activity Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (6)