S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

Gengmo Zhou; Zhen Wang; Feng Yu; Guolin Ke; Zhewei Wei; Zhifeng Gao

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

Gengmo Zhou, Zhen Wang, Feng Yu, Guolin Ke, Zhewei Wei, Zhifeng Gao

TL;DR

S-MolSearch is the first framework to the authors' knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening, and demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E.

Abstract

Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for AUROC, BEDROC and EF.

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

TL;DR

Abstract

Paper Structure (31 sections, 3 theorems, 16 equations, 4 figures, 5 tables)

This paper contains 31 sections, 3 theorems, 16 equations, 4 figures, 5 tables.

Introduction
Related work
Virtual Screening
Optimal Transport and inverse optimal transport
Semi-supervised learning
Method
Overview
Pretraining Backbone of Molecular Encoder
Training Strategy of Encoder on Labeled Dataset
Training Strategy of Encoder on Full Dataset
Regularization techniques
Framework for S-Molsearch Induced by Inverse Optimal Transport
Experiments
Training Data
Benchmarks
...and 16 more sections

Key Result

Proposition 1

Given encoder $f_{\theta}$ for labeled dataset $X_{sup}$ and $g_{\psi}$ for full dataset $X_{full}$, $x_{sup}$ represents the embeddings of labeled data from $f_{\theta}$ , while $x_{full}$ represents the embeddings of the full dataset from $g_{\psi}$. Semi-supervised contrastive learning is then fo where $KL(X||Y) = \sum\limits_{ij}x_{ij}log\frac{x_{ij}}{y_{ij}} - x_{ij} + y_{ij}$ represents the

Figures (4)

Figure 1: Overview of S-MolSearch Framework
Figure 2: t-SNE visualization of molecular representations learned by S-MolSearch versus pretrained checkpoint. Different colors represent different protein targets' active molecules.
Figure 3: Performance on DUD-E and LIT-PCBA with varying numbers of labeled data, while keeping unlabeled data fixed at 1m. The blue bars represent the results of encoder $g_{\psi}$, while the green bars represent the results of encoder $f_{\theta}$.
Figure 4: Qualitative examples of similarities for targets hdac2 and csf1r in DUD-E.

Theorems & Definitions (3)

Proposition 1
Proposition 2
Lemma 3

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

TL;DR

Abstract

S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule Search

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)