BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

Anjie Qiao; Zhen Wang; Yaliang Li; Jiahua Rao; Yuedong Yang

BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

Anjie Qiao, Zhen Wang, Yaliang Li, Jiahua Rao, Yuedong Yang

TL;DR

BindCLIP is proposed, a unified contrastive-generative representation learning framework for virtual screening that achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark.

Abstract

Virtual screening aims to efficiently identify active ligands from massive chemical libraries for a given target pocket. Recent CLIP-style models such as DrugCLIP enable scalable virtual screening by embedding pockets and ligands into a shared space. However, our analyses indicate that such representations can be insensitive to fine-grained binding interactions and may rely on shortcut correlations in training data, limiting their ability to rank ligands by true binding compatibility. To address these issues, we propose BindCLIP, a unified contrastive-generative representation learning framework for virtual screening. BindCLIP jointly trains pocket and ligand encoders using CLIP-style contrastive learning together with a pocket-conditioned diffusion objective for binding pose generation, so that pose-level supervision directly shapes the retrieval embedding space toward interaction-relevant features. To further mitigate shortcut reliance, we introduce hard-negative augmentation and a ligand-ligand anchoring regularizer that prevents representation collapse. Experiments on two public benchmarks demonstrate consistent improvements over strong baselines. BindCLIP achieves substantial gains on challenging out-of-distribution virtual screening and improves ligand-analogue ranking on the FEP+ benchmark. Together, these results indicate that integrating generative, pose-level supervision with contrastive learning yields more interaction-aware embeddings and improves generalization in realistic screening settings, bringing virtual screening closer to real-world applicability.

BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

TL;DR

Abstract

Paper Structure (24 sections, 13 equations, 8 figures, 4 tables)

This paper contains 24 sections, 13 equations, 8 figures, 4 tables.

Introduction
Related Work
Methodology
Problem Setup and CLIP-style Approach
Binding Pose Generation-guided Representation Learning
Hard-Negative Augmentation with Ligand-Ligand Anchoring Regularizer
Training and Inference
Experiments
Experimental Setup
(RQ1) Results and Analysis
(RQ2) Ablation Study
(RQ3) Probing Fine-grained Interaction Knowledge.
Conclusion
Appendix
Interaction-disrupting Operations.
...and 9 more sections

Figures (8)

Figure 1: Illustrative probe cases from the test set.
Figure 2: Overview of the BindCLIP framework. (a) Pocket-conditioned ligand binding pose generation objective; (b) Contrastive learning objective; (c) Hard-negative augmentation workflow.
Figure 3: Visualization of pocket, ligand, and hard-negative embeddings with/without the anchoring regularizer (t-SNE).
Figure 4: Out-of-distribution evaluation on the MF-PCBA subset, excluding assays whose proteins have more than 30% sequence identity with any protein in the training set.
Figure 5: Evaluation on the 4-target FEP subset (CDK2, TYK2, JNK1, and P38) for ligand activity ranking within a chemical series.
...and 3 more figures

BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

TL;DR

Abstract

BindCLIP: A Unified Contrastive-Generative Representation Learning Framework for Virtual Screening

Authors

TL;DR

Abstract

Table of Contents

Figures (8)