ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

Bowen Gao; Yinjun Jia; Yuanle Mo; Yuyan Ni; Weiying Ma; Zhiming Ma; Yanyan Lan

ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

Bowen Gao, Yinjun Jia, Yuanle Mo, Yuyan Ni, Weiying Ma, Zhiming Ma, Yanyan Lan

TL;DR

ProFSA tackles the scarcity of protein–ligand complex data by constructing a large-scale pseudo-ligand–pocket dataset from protein-only structures through fragment-based pocket surroundings. It trains a pocket encoder to align with fixed small-molecule encoders via a molecular-guided contrastive objective, enabling transfer of ligand-binding knowledge to pocket representations. The approach achieves state-of-the-art performance on pocket druggability, pocket matching, and ligand binding affinity prediction, with notable zero-shot generalization. By leveraging abundant protein structural data and pretrained molecular encoders, ProFSA offers a scalable pathway to model protein–ligand interactions and could extend to predicted structures and broader interaction tasks.

Abstract

Pocket representations play a vital role in various biomedical applications, such as druggability estimation, ligand affinity prediction, and de novo drug design. While existing geometric features and pretrained representations have demonstrated promising results, they usually treat pockets independent of ligands, neglecting the fundamental interactions between them. However, the limited pocket-ligand complex structures available in the PDB database (less than 100 thousand non-redundant pairs) hampers large-scale pretraining endeavors for interaction modeling. To address this constraint, we propose a novel pocket pretraining approach that leverages knowledge from high-resolution atomic protein structures, assisted by highly effective pretrained small molecule representations. By segmenting protein structures into drug-like fragments and their corresponding pockets, we obtain a reasonable simulation of ligand-receptor interactions, resulting in the generation of over 5 million complexes. Subsequently, the pocket encoder is trained in a contrastive manner to align with the representation of pseudo-ligand furnished by some pretrained small molecule encoders. Our method, named ProFSA, achieves state-of-the-art performance across various tasks, including pocket druggability prediction, pocket matching, and ligand binding affinity prediction. Notably, ProFSA surpasses other pretraining methods by a substantial margin. Moreover, our work opens up a new avenue for mitigating the scarcity of protein-ligand complex data through the utilization of high-quality and diverse protein structure databases.

ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

TL;DR

Abstract

Paper Structure (52 sections, 2 theorems, 24 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 52 sections, 2 theorems, 24 equations, 10 figures, 12 tables, 1 algorithm.

Introduction
Related Work
Pocket Pretraining Data
Pocket Pretraining Methods
Molecule Pretraining Methods
Our Approach
Constructing Pseudo-Ligand-Pocket Complexes from Protein Data
Contrastive Learning in Pocket-Fragment Space
Experiments
pocket druggability prediction
Experimental Configuration
Baselines
Results
pocket matching
Experimental Configuration
...and 37 more sections

Key Result

Theorem B.1

Assume $g_T$ is Lipschitz continuous with Lipschitz constant $l_T$ and $g_S$ is normalized. $\exists C_1,C_2$ are constants, $\forall g_T({\bm{t}}^{(m)})$ in the line segment with $g_T({\bm{t}})$ and $g_T({\bm{t}}^{(0)})$ as endpoints, $\mathcal{L}_1({\bm{t}}^{(m)},{\bm{s}})\leq C_1$, $\mathcal{L}_2 Then $\forall \epsilon>0$, when the pre-training loss is sufficiently small: $\mathcal{L}_1({\bm{t}

Figures (10)

Figure 1: a) The pipeline for isolating pocket-ligand pairs from proteins; b) Joint distributions of the pocket size and the ligand size of the PDBBind dataset, our ProFSA dataset before stratified sampling and after stratified sampling, respectively; c) Comparations between the ProFSA dataset and the PDBBind dataset in terms of distributions of rBSA of ligand-pocket pairs.
Figure 2: Visualization of common interaction types presented in both pseudo pairs and real pairs.
Figure 3: An illustration of protein fragment-surroundings alignment framework. Pockets are encoded by our pocket encoder, which is trained to align with fragment representations given by fixed pretrained molecule encoders. A simplified hydropathy-related (indicated by blue or orange color) example illustrates that fragment properties recognized by pretrained molecule encoders could guide pocket representation learning.
Figure 4: a) Visualization of two estradiol binding proteins that ProFSA performs better on. Positively charged, negatively charged, polar, and hydrophobic amino acids are represented in different colors to visualize interaction patterns. Hydrogen bonds are represented by dashed lines. b) A t-SNE visualization of pretrained representations of 7 types of ligand binding pockets collected from the BioLip database. Our ProFSA model distinguished FMN, ATP, AMP, and GLC binding pockets better compared with the Uni-Mol model.
Figure 5: Comparation of different pretraining data sizes. a) Results of pocket matching; b) Results of druggability prediction.
...and 5 more figures

Theorems & Definitions (4)

Theorem B.1
Lemma B.2
proof
proof : Proof of theorem \ref{['thm:app thm']}

ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

TL;DR

Abstract

ProFSA: Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (4)