PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings

Ziyang Xu; Haitian Zhong; Bingrui He; Xueying Wang; Tianchi Lu

PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings

Ziyang Xu, Haitian Zhong, Bingrui He, Xueying Wang, Tianchi Lu

TL;DR

PTransIPs addresses the challenge of phosphorylation site identification by integrating protein pre-trained language model embeddings (ProtTrans for sequences and EMBER2 for structure) with a Transformer-CNN architecture trained via TIM loss. The approach delivers state-of-the-art independent AUCs of 0.9232 for S/T and 0.9660 for Y sites, with ablation studies showing the dominant contribution of sequence PLM embeddings and the regularizing benefit of TIM loss. The method demonstrates strong generalization to broader peptide bioactivities, indicating potential as a universal encoding framework for peptide-level predictions, and the authors provide public code and data. This work highlights the value of leveraging large-scale protein priors to improve site-specific predictions in data-limited biological settings.

Abstract

Phosphorylation is pivotal in numerous fundamental cellular processes and plays a significant role in the onset and progression of various diseases. The accurate identification of these phosphorylation sites is crucial for unraveling the molecular mechanisms within cells and during viral infections, potentially leading to the discovery of novel therapeutic targets. In this study, we develop PTransIPs, a new deep learning framework for the identification of phosphorylation sites. Independent testing results demonstrate that PTransIPs outperforms existing state-of-the-art (SOTA) methods, achieving AUCs of 0.9232 and 0.9660 for the identification of phosphorylated S/T and Y sites, respectively. PTransIPs contributes from three aspects. 1) PTransIPs is the first to apply protein pre-trained language model (PLM) embeddings to this task. It utilizes ProtTrans and EMBER2 to extract sequence and structure embeddings, respectively, as additional inputs into the model, effectively addressing issues of dataset size and overfitting, thus enhancing model performance; 2) PTransIPs is based on Transformer architecture, optimized through the integration of convolutional neural networks and TIM loss function, providing practical insights for model design and training; 3) The encoding of amino acids in PTransIPs enables it to serve as a universal framework for other peptide bioactivity tasks, with its excellent performance shown in extended experiments of this paper. Our code, data and models are publicly available at https://github.com/StatXzy7/PTransIPs.

PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings

TL;DR

Abstract

Paper Structure (20 sections, 11 equations, 4 figures, 6 tables)

This paper contains 20 sections, 11 equations, 4 figures, 6 tables.

Introduction
Materials and methods
Datasets
Token and position embedding
pre-trained embeddings for sequence and structure
The architecture of PTransIPs
Data integration
The Transformer module
The CNN module
The TIM loss function
Hyperparameter setting
Performance evaluation
Results
Evaluating the contribution of pre-trained model embedding to results
Training with the TIM loss function improves the performance of PTransIPs
...and 5 more sections

Figures (4)

Figure 1: PTransIPs architecture. The figure illustrates the steps of the PTransIPs model for identifying SARS-CoV-2 phosphorylation sites. It starts with data collection (Step 1) where S/T and Y phosphorylation sites dataset is gathered. Next, in the word embedding phase (Step 2), a unique 1024-dimensional vector representation for each amino acid type in the sequence is constructed. Data integration (Step 3) combines these embeddings to enhance the representational capacity of input data. The integrated data are then processed in parallel by a CNN with residual connections and a Transformer based on multi-head attention in the deep learning network phase (Step 4). The outputs of the two models are then connected to a fully connected layer classifier to predict the phosphorylation sites.
Figure 2: ROC and PR curves for phosphorylation site identification for ablation study on pre-trained embedding. This figure shows the comparison of ROC and PR curves among PTransIPs, the model using only sequence pre-trained embedding, the model using only structure pre-trained embedding, and the model without any pre-trained embeddings. (A–B) show the ROC and PR curves for the S/T dataset, while (C–D) show the same curves for the Y dataset.
Figure 3: ROC and PR Curves for Phosphorylation Site Identification in the Ablation Study of the Three Terms of the TIM Loss Function. This figure shows the comparison of ROC and PR curves between the complete TIM Loss function used by PTransIPs $CE - \widehat{\mathcal{H}}\left(Y\right) +\widehat{\mathcal{H}}\left(Y \mid X\right)$, the original cross-entropy loss $CE$, and the loss functions with either Marginal Entropy $\widehat{\mathcal{H}}\left(Y \right)$ or Conditional Entropy $\widehat{\mathcal{H}}\left(Y \mid X \right)$ removed. (A-B) show the ROC and PR curves for the S/T dataset, while (C-D) show the same curves for the Y dataset.
Figure 4: UMAP-based 2D Feature Space Distribution of Positive and Negative Samples for S/T and Y Training Sets. The figure shows the distribution of S/T and Y sites in the feature space generated by UMAP, based on the original features from input data (A, E), features from the pre-trained sequence model (B, F), features from the pre-trained structure model (C, G), and output features from the deep learning network (CNN and Transformer modules) (D, H). Blue and red dots represent positive and negative samples, respectively.

PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings

TL;DR

Abstract

PTransIPs: Identification of phosphorylation sites enhanced by protein PLM embeddings

Authors

TL;DR

Abstract

Table of Contents

Figures (4)