PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings
Ziyang Xu, Haitian Zhong, Bingrui He, Xueying Wang, Tianchi Lu
TL;DR
PTransIPs addresses the challenge of phosphorylation site identification by integrating protein pre-trained language model embeddings (ProtTrans for sequences and EMBER2 for structure) with a Transformer-CNN architecture trained via TIM loss. The approach delivers state-of-the-art independent AUCs of 0.9232 for S/T and 0.9660 for Y sites, with ablation studies showing the dominant contribution of sequence PLM embeddings and the regularizing benefit of TIM loss. The method demonstrates strong generalization to broader peptide bioactivities, indicating potential as a universal encoding framework for peptide-level predictions, and the authors provide public code and data. This work highlights the value of leveraging large-scale protein priors to improve site-specific predictions in data-limited biological settings.
Abstract
Phosphorylation is pivotal in numerous fundamental cellular processes and plays a significant role in the onset and progression of various diseases. The accurate identification of these phosphorylation sites is crucial for unraveling the molecular mechanisms within cells and during viral infections, potentially leading to the discovery of novel therapeutic targets. In this study, we develop PTransIPs, a new deep learning framework for the identification of phosphorylation sites. Independent testing results demonstrate that PTransIPs outperforms existing state-of-the-art (SOTA) methods, achieving AUCs of 0.9232 and 0.9660 for the identification of phosphorylated S/T and Y sites, respectively. PTransIPs contributes from three aspects. 1) PTransIPs is the first to apply protein pre-trained language model (PLM) embeddings to this task. It utilizes ProtTrans and EMBER2 to extract sequence and structure embeddings, respectively, as additional inputs into the model, effectively addressing issues of dataset size and overfitting, thus enhancing model performance; 2) PTransIPs is based on Transformer architecture, optimized through the integration of convolutional neural networks and TIM loss function, providing practical insights for model design and training; 3) The encoding of amino acids in PTransIPs enables it to serve as a universal framework for other peptide bioactivity tasks, with its excellent performance shown in extended experiments of this paper. Our code, data and models are publicly available at https://github.com/StatXzy7/PTransIPs.
