Table of Contents
Fetching ...

DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries

Hanqun Cao, Mutian He, Ning Ma, Chang-yu Hsieh, Chunbin Gu, Pheng-Ann Heng

TL;DR

DEL-Ranking addresses two core DEL challenges—Distribution Noise and Distribution Shift—by coupling a ZIP-based read-count model with ranking-driven denoising and an Activity-Referenced Correction (ARC) module. The method introduces Pairwise Soft Ranking and Listwise Global Ranking losses to enforce local and global read-count order, while ARC uses self-training and a consistency signal grounded in activity labels to align read counts with true binding affinities. It achieves state-of-the-art correlations with Ki across multiple DEL datasets and demonstrates zero-shot generalization to new targets, along with the ability to uncover high-affinity functional groups such as Pyrimidine Sulfonamide. The work also provides new, multi-modal DEL datasets and ablation analyses illustrating the contribution of each component toward improved interpretability and predictive accuracy, potentially accelerating rational DEL-driven drug discovery.

Abstract

DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self-training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi-dimensional molecular representations, protein-ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI-driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero-shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.

DEL-Ranking: Ranking-Correction Denoising Framework for Elucidating Molecular Affinities in DNA-Encoded Libraries

TL;DR

DEL-Ranking addresses two core DEL challenges—Distribution Noise and Distribution Shift—by coupling a ZIP-based read-count model with ranking-driven denoising and an Activity-Referenced Correction (ARC) module. The method introduces Pairwise Soft Ranking and Listwise Global Ranking losses to enforce local and global read-count order, while ARC uses self-training and a consistency signal grounded in activity labels to align read counts with true binding affinities. It achieves state-of-the-art correlations with Ki across multiple DEL datasets and demonstrates zero-shot generalization to new targets, along with the ability to uncover high-affinity functional groups such as Pyrimidine Sulfonamide. The work also provides new, multi-modal DEL datasets and ablation analyses illustrating the contribution of each component toward improved interpretability and predictive accuracy, potentially accelerating rational DEL-driven drug discovery.

Abstract

DNA-encoded library (DEL) screening has revolutionized the detection of protein-ligand interactions through read counts, enabling rapid exploration of vast chemical spaces. However, noise in read counts, stemming from nonspecific interactions, can mislead this exploration process. We present DEL-Ranking, a novel distribution-correction denoising framework that addresses these challenges. Our approach introduces two key innovations: (1) a novel ranking loss that rectifies relative magnitude relationships between read counts, enabling the learning of causal features determining activity levels, and (2) an iterative algorithm employing self-training and consistency loss to establish model coherence between activity label and read count predictions. Furthermore, we contribute three new DEL screening datasets, the first to comprehensively include multi-dimensional molecular representations, protein-ligand enrichment values, and their activity labels. These datasets mitigate data scarcity issues in AI-driven DEL screening research. Rigorous evaluation on diverse DEL datasets demonstrates DEL-Ranking's superior performance across multiple correlation metrics, with significant improvements in binding affinity prediction accuracy. Our model exhibits zero-shot generalization ability across different protein targets and successfully identifies potential motifs determining compound binding affinity. This work advances DEL screening analysis and provides valuable resources for future research in this area.

Paper Structure

This paper contains 27 sections, 3 theorems, 35 equations, 12 figures, 7 tables, 1 algorithm.

Key Result

Lemma 1

Given a set of feature-read count pairs $\{(x_i, r_i)\}_{i=1}^n$, where $x_i$ is the fused representation of sample $i$ based on $f_i$ and $p_i$, and a well-fitted Zero-Inflated Poisson model $f_\text{ZIP}(r|x)$, the ranking loss $\mathcal{L}_\text{rank}$ provides positive information gain over the where $H(R | \cdot)$ denotes the conditional entropy of read counts $R$.

Figures (12)

  • Figure 1: Illustration of the DEL screening process. Cycling: Creating unique compounds, each tagged with a distinctive DNA sequence. Binding: These compounds are then exposed to the target protein. Wash, Elute and Amplify: Compounds that bind to the target are retained, while others are washed away. The DNA tags of the bound compounds are then amplified and analyzed using sequencing techniques. Sequence & Counting: This process results in a distribution of read counts for both the target-bound samples and control samples.
  • Figure 2: Overview of DEL-Ranking framework. The model directly fuses molecule binding poses and fingerprints as input features. ARC employs target effects and binding affinity to enhance read count prediction. The ranking-based loss incorporates target effects and matrix effects for noise removal, improving the correlation between predicted read counts and true binding affinities.
  • Figure 3: Quantitative analysis of Top-50 selection, including $K_i$ distribution and accuracy.
  • Figure 4: Quantitative analysis of Top-50 selection, including $K_i$ distribution and accuracy for DEL-Dock shmilovich2023dock.
  • Figure 5: Visualization of Top-50 high affinity cases without benzene sulfonamide.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Lemma 1
  • Theorem 2
  • proof
  • proof
  • Proposition 1