A general language model for peptide function identification
Jixiu Zhai, Zikun Wang, Chupei Tang, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang, Tianchi Lu
TL;DR
PDeepPP introduces a unified deep-learning framework for peptide function identification that generalizes across bioactive peptides and PTM sites by fusing 650M-parameter ESM-2 embeddings with a dual CNN–Transformer backbone and a TIM-based loss to address data imbalance. The approach jointly captures global sequence context and local motifs, achieving state-of-the-art results on 25 of 33 benchmark tasks, including notable accuracy for antimicrobial peptides, phosphorylation sites, and glycosylation specificity. Interpretability analyses reveal biologically meaningful sequence motifs that align with known patterns, supporting the model's relevance for peptide biology. The work provides a scalable, open platform for large-scale peptide analysis with potential to accelerate therapeutic discovery and reduce reliance on experimental screening.
Abstract
Accurate identification of bioactive peptides (BPs) and protein post-translational modifications (PTMs) is essential for understanding protein function and advancing therapeutic discovery. However, most computational methods remain limited in their generalizability across diverse peptide functions. Here, we present PDeepPP, a unified deep learning framework that integrates pretrained protein language models with a hybrid transformer-CNN architecture, enabling robust identification across diverse peptide classes and PTM sites. We curated comprehensive benchmark datasets and implemented strategies to address data imbalance, allowing PDeepPP to systematically extract both global and local sequence features. Through extensive analyses including dimensionality reduction and comparison studies, PDeepPP demonstrates strong, interpretable peptide representations and achieves state-of-the-art performance in 25 of the 33 biological identification tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and phosphorylation site (0.9984) identification, with 99.5% specificity in glycosylation site prediction and substantial reduction in false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available via GitHub (https://github.com/fondress/PDeepPP) and Hugging Face (https://huggingface.co/fondress/PDeppPP)
