Table of Contents
Fetching ...

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao

TL;DR

S-MolSearch is the first framework to the authors' knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening, and demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E.

Abstract

Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for AUROC, BEDROC and EF.

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

TL;DR

S-MolSearch is the first framework to the authors' knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening, and demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E.

Abstract

Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for AUROC, BEDROC and EF.
Paper Structure (31 sections, 3 theorems, 16 equations, 4 figures, 5 tables)

This paper contains 31 sections, 3 theorems, 16 equations, 4 figures, 5 tables.

Key Result

Proposition 1

Given encoder $f_{\theta}$ for labeled dataset $X_{sup}$ and $g_{\psi}$ for full dataset $X_{full}$, $x_{sup}$ represents the embeddings of labeled data from $f_{\theta}$ , while $x_{full}$ represents the embeddings of the full dataset from $g_{\psi}$. Semi-supervised contrastive learning is then fo where $KL(X||Y) = \sum\limits_{ij}x_{ij}log\frac{x_{ij}}{y_{ij}} - x_{ij} + y_{ij}$ represents the

Figures (4)

  • Figure 1: Overview of S-MolSearch Framework
  • Figure 2: t-SNE visualization of molecular representations learned by S-MolSearch versus pretrained checkpoint. Different colors represent different protein targets' active molecules.
  • Figure 3: Performance on DUD-E and LIT-PCBA with varying numbers of labeled data, while keeping unlabeled data fixed at 1m. The blue bars represent the results of encoder $g_{\psi}$, while the green bars represent the results of encoder $f_{\theta}$.
  • Figure 4: Qualitative examples of similarities for targets hdac2 and csf1r in DUD-E.

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Lemma 3