Table of Contents
Fetching ...

A general language model for peptide function identification

Jixiu Zhai, Zikun Wang, Chupei Tang, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang, Tianchi Lu

TL;DR

PDeepPP introduces a unified deep-learning framework for peptide function identification that generalizes across bioactive peptides and PTM sites by fusing 650M-parameter ESM-2 embeddings with a dual CNN–Transformer backbone and a TIM-based loss to address data imbalance. The approach jointly captures global sequence context and local motifs, achieving state-of-the-art results on 25 of 33 benchmark tasks, including notable accuracy for antimicrobial peptides, phosphorylation sites, and glycosylation specificity. Interpretability analyses reveal biologically meaningful sequence motifs that align with known patterns, supporting the model's relevance for peptide biology. The work provides a scalable, open platform for large-scale peptide analysis with potential to accelerate therapeutic discovery and reduce reliance on experimental screening.

Abstract

Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-CNN architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses including dimensionality reduction and comparison studies, PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub (https://github.com/fondress/PDeepPP) and Hugging Face (https://huggingface.co/fondress/PDeppPP)

A general language model for peptide function identification

TL;DR

PDeepPP introduces a unified deep-learning framework for peptide function identification that generalizes across bioactive peptides and PTM sites by fusing 650M-parameter ESM-2 embeddings with a dual CNN–Transformer backbone and a TIM-based loss to address data imbalance. The approach jointly captures global sequence context and local motifs, achieving state-of-the-art results on 25 of 33 benchmark tasks, including notable accuracy for antimicrobial peptides, phosphorylation sites, and glycosylation specificity. Interpretability analyses reveal biologically meaningful sequence motifs that align with known patterns, supporting the model's relevance for peptide biology. The work provides a scalable, open platform for large-scale peptide analysis with potential to accelerate therapeutic discovery and reduce reliance on experimental screening.

Abstract

Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-CNN architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses including dimensionality reduction and comparison studies, PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub (https://github.com/fondress/PDeepPP) and Hugging Face (https://huggingface.co/fondress/PDeppPP)

Paper Structure

This paper contains 18 sections, 12 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The PDeepPP model usage process consists of three parts: (a) Protein extraction and trimming, peptide chain annotation with task classification, and segment trimming centered on the target amino acid with positive and negative sample classification based on site type. Special loss functions are applied to handle imbalanced datasets. (b) The model framework integrates protein-specific ESM-2 embeddings with a basic tokenizer and weighted linear layers, followed by a parallel network for global and local feature fusion, and convolutional layers for binary classification.
  • Figure 2: Performance evaluation and analysis of PDeepPP. Panels (a-d) show the performance evaluation on four representative datasets: (a) Anticancer_main, (b) Antiviral, (c) N6-acetyllysine, and (d) Ubiquitination. For each dataset, the panels from left to right show the ROC curve, PR curve, confusion matrix, and UMAP dimensionality reduction visualization, respectively. Panel (e) is a line chart comparing the average performance metrics of PDeepPP and baseline models across all BP and PTM datasets. Panel (f) illustrates the sample distribution for the training and test sets of the four representative datasets. Panel (g) features incremental bar charts showing the performance delta of PDeepPP against the three baseline models across all evaluation metrics. In the two subplots for each row, each bar chart represents the distribution of numerical differences (PDeepPP minus the baseline model) for that metric across all tasks, including the magnitude and extreme values of all differences. Performance evaluation plots for the remaining tasks are available in the Supplementary Material.
  • Figure 3: Interpretability analysis of PDeepPP across four datasets. (a) N6-acetyllysine, (b) Ubiquitination, (c) Antiviral, (d) Anticancer_main. Each panel contains three visualizations from top to bottom: (1) Sequence logo of original positive samples showing amino acid conservation patterns, where letter height represents information content (bits) with highly conserved residues (high information) stacked on top; (2) Sequence logo of PDeepPP-predicted positive samples; (3) Attribution heatmap showing aggregated Integrated Gradients attributions for positive predictions, where color intensity (white to dark red) indicates the importance of each amino acid at each position for model decisions. For the two PTM tasks, Position 16 (central site) is excluded from visualization and marked with a dashed line. For the two BP tasks, due to inconsistent sequence lengths across samples, three representative lengths were selected for each task. The visualizations shown here correspond to the most frequent length. Length distributions for both datasets and interpretability visualizations for the other representative lengths are provided in the Supplementary Material.
  • Figure 4: These sections show the identification results of the PDeepPP model and its six ablated variants on four tasks: Anticancer_main, Antiviral, N6-acetyllysine, and Ubiquitination. (a) - (d) In each subplot, the bar chart on the left displays the counts for the four confusion matrix metrics (TP, TN, FP, FN); the table in the middle compares the numerical values of different models on various performance metrics (acc, bacc, sn, sp, mcc), where red indicates the best performance and blue indicates the second-best. The right side shows the ROC and PR curves for the models.(e) This section shows the count distribution of the optimal parameter values obtained after training the embedding and loss components on all 33 internal datasets.(f) This section shows the average acc, bacc, sn, sp, and mcc of all models across all datasets.(g) This section shows the counts and proportions of positive and negative samples in the external datasets.
  • Figure 5: Performance evaluation of prediction models across four protein datasets. (a) Detailed visualization of prediction performance in a 2x2 layout, with density plots in the upper panel and scatter plots in the lower panel for each subplot, and the task name above the subplot. For each dataset, the top row shows kernel density estimation plots for the four prediction categories (TN, TP, FN, FP), with all five models overlaid for comparison. The bottom row presents scatter plots of true labels versus predicted probabilities for each individual model, with points color-coded by prediction category. The red dashed line represents the decision threshold at 0.5. Sample sizes for each category are indicated in the legends.(b) Violin plots comparing prediction probability distributions of the PDeepPP model and different ablation models across four datasets. Each plot displays the density distribution of prediction values for TN, TP, FN and FP. (c) Statistical summary of prediction value distribution across four prediction categories (TN, TP, FN, FP) for all models and datasets combined. The table presents the count, mean, median, standard deviation, and quartile values (Q25, Q75) for each model-category combination, with all statistical values rounded to four decimal places. The backgrounds are color-coded to correspond with the prediction categories.