Table of Contents
Fetching ...

GeoPep: A geometry-aware masked language model for protein-peptide binding site prediction

Dian Chen, Yunkai Chen, Tong Lin, Sijie Chen, Xiaolin Cheng

TL;DR

GeoPep addresses the challenge of predicting peptide-binding sites on proteins amid peptide flexibility and limited structural data by transferring knowledge from the multimodal ESM3 foundation model and enhancing it with parameter-efficient Kolmogorov-Arnold Networks alongside distance-based geometric losses. The method leverages ESM3’s integrated sequence–structure representations and enforces spatial coherence through a geometry-aware objective, achieving state-of-the-art performance on peptide–protein benchmarks and superior geometric localization of interfaces. Structural evaluations and comparisons to existing methods demonstrate GeoPep’s robustness to induced-fit interfaces and its ability to generalize beyond pre-formed pockets, suggesting significant potential for peptide therapeutics design and integration into drug discovery pipelines. The work highlights the value of combining foundation-model transfer learning with geometry-aware regularization for specialized molecular interaction tasks, while acknowledging data limitations and suggesting avenues for dataset expansion and affinity-oriented extensions.

Abstract

Multimodal approaches that integrate protein structure and sequence have achieved remarkable success in protein-protein interface prediction. However, extending these methods to protein-peptide interactions remains challenging due to the inherent conformational flexibility of peptides and the limited availability of structural data that hinder direct training of structure-aware models. To address these limitations, we introduce GeoPep, a novel framework for peptide binding site prediction that leverages transfer learning from ESM3, a multimodal protein foundation model. GeoPep fine-tunes ESM3's rich pre-learned representations from protein-protein binding to address the limited availability of protein-peptide binding data. The fine-tuned model is further integrated with a parameter-efficient neural network architecture capable of learning complex patterns from sparse data. Furthermore, the model is trained using distance-based loss functions that exploit 3D structural information to enhance binding site prediction. Comprehensive evaluations demonstrate that GeoPep significantly outperforms existing methods in protein-peptide binding site prediction by effectively capturing sparse and heterogeneous binding patterns.

GeoPep: A geometry-aware masked language model for protein-peptide binding site prediction

TL;DR

GeoPep addresses the challenge of predicting peptide-binding sites on proteins amid peptide flexibility and limited structural data by transferring knowledge from the multimodal ESM3 foundation model and enhancing it with parameter-efficient Kolmogorov-Arnold Networks alongside distance-based geometric losses. The method leverages ESM3’s integrated sequence–structure representations and enforces spatial coherence through a geometry-aware objective, achieving state-of-the-art performance on peptide–protein benchmarks and superior geometric localization of interfaces. Structural evaluations and comparisons to existing methods demonstrate GeoPep’s robustness to induced-fit interfaces and its ability to generalize beyond pre-formed pockets, suggesting significant potential for peptide therapeutics design and integration into drug discovery pipelines. The work highlights the value of combining foundation-model transfer learning with geometry-aware regularization for specialized molecular interaction tasks, while acknowledging data limitations and suggesting avenues for dataset expansion and affinity-oriented extensions.

Abstract

Multimodal approaches that integrate protein structure and sequence have achieved remarkable success in protein-protein interface prediction. However, extending these methods to protein-peptide interactions remains challenging due to the inherent conformational flexibility of peptides and the limited availability of structural data that hinder direct training of structure-aware models. To address these limitations, we introduce GeoPep, a novel framework for peptide binding site prediction that leverages transfer learning from ESM3, a multimodal protein foundation model. GeoPep fine-tunes ESM3's rich pre-learned representations from protein-protein binding to address the limited availability of protein-peptide binding data. The fine-tuned model is further integrated with a parameter-efficient neural network architecture capable of learning complex patterns from sparse data. Furthermore, the model is trained using distance-based loss functions that exploit 3D structural information to enhance binding site prediction. Comprehensive evaluations demonstrate that GeoPep significantly outperforms existing methods in protein-peptide binding site prediction by effectively capturing sparse and heterogeneous binding patterns.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: a GeoPep architecture combining ESM3 transfer learning with Kolmogorov-Arnold Network (KAN) modules for peptide binding site prediction. Peptide and protein sequences are processed through ESM3's multimodal encoder, which integrates sequence, structural, and functional information via transformer blocks with geometric attention mechanisms. The resulting ESM3 embeddings are passed to KAN layers that employ learnable B-spline activation functions to model complex nonlinear binding patterns, outputting residue-level binding probabilities. b Distance-based geometric loss function enforcing spatial consistency in binding site predictions. For each non-interface residue incorrectly predicted as a binding site (false positive), $r_1$ represents the minimum 3D distance to the nearest true interface residue, and $r_2$ represents the maximum such distance across all false positives in the complex. The geometric loss is computed as $(r_1/r_2) \times \lambda$, where $\lambda$ is the regularization weight. This distance-normalized penalty ensures that false positives farther from true binding sites incur larger penalties, guiding the model toward spatially coherent predictions clustered around verified interface residues. Detailed mathematical formulations are provided in Methods.
  • Figure 2: a ROC and precision-recall curves comparing GeoPep performance using ESM2 versus ESM3 backbones. b Distributions of 3D distance-based loss for ESM2 and ESM3, with calculation methodology detailed in Methods. c Binding site prediction visualizations for ESM2 (top) and ESM3 (bottom) on representative peptide-protein complexes (PDB IDs: 7JRP, 3L75, 1LTI, and 1XLS). Peptides are colored yellow, and true positives residues are highlighted in blue and false positives in red.
  • Figure 3: a ROC curves comparing KAN versus MLP architectures (top) and distance-based versus standard loss functions (bottom) in GeoPep. For the distance-based loss comparison, a sequence-level window size of 1 is used to expand ground-truth labels, providing a relaxed evaluation criterion. b Distributions of 3D distance-based loss for architectural and regularization comparisons, with calculation methodology detailed in Methods. c Representative binding site prediction visualizations for different model configurations. The top row shows predictions from KAN (left) and MLP (right) for PDB 2VPE, while the bottom row compares distance-based loss (left) and standard loss (right) configurations for PDB 1MV9.
  • Figure 4: a ROC curves comparing GeoPep with baseline methods (PepNN, PesTo, ScanNet) computed using a sequence-level window size of 1 to expand ground-truth labels for a relaxed evaluation criterion. b True positive volume ratio (TPVR) analysis at two thresholds (0.5 and 0.8), calculated as the ratio of true positive volume to total predicted volume; higher ratios indicate more accurate and contiguous binding surface predictions. c Distributions of 3D distance-based loss across methods, with calculation methodology detailed in Methods. Lower values correspond to better geometric accuracy in binding site predictions. d Visualization of peptide–protein interface predictions for the 1R08 complex. Ground truth interface residues are shown in the leftmost panel. For predictions by GeoPep, PepNN, Pesto, and ScanNet, peptides are colored yellow, true positives are highlighted in blue, and false positives in red.
  • Figure 5: (a) Case studies of peptide--protein interface prediction with 100% residue-level accuracy; peptides are shown in yellow and correctly predicted interface residues in slate. Left to right: 6U3O, 3DNO, 1A1M (interface-residue count increases from low to moderate to high). (b) Overlay of the apo (2PC0) and holo (2NXL) states of HIV protease illustrating conformational change upon peptide binding. (c) Holo state rotated by 90$^\circ$ about the indicated axis for structural clarity. (d) GeoPep predictions on the same target alongside three baselines (PepNN, Pesto, and ScanNet). (e) Per-class normalized histograms and kernel density estimates (KDEs) for Positive (blue) and Negative (orange) predictions, highlighting $\Delta$RSA distributions. (f) Distributions of per-entry $\Delta$RSA for Positive vs. Negative predictions. (g) Distribution of per-entry differences between the mean $\Delta$RSA of residues predicted as Positive and those predicted as Negative. (h) Protein-level interface recall plotted against interface ratio.