Table of Contents
Fetching ...

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Lirong Wu, Siyuan Li, Yufei Huang, Jun Xia, Bozhen Hu, Stan Z. Li

TL;DR

The MeToken model is introduced, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens, providing a holistic view of the factors influencing PTM sites.

Abstract

Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome, regulating protein attributes and interactions that are crucial for biological processes. Accurately predicting PTM sites and their specific types is therefore essential for elucidating protein function and understanding disease mechanisms. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. However, these approaches often overlook protein structural contexts. In this work, we first compile a large-scale sequence-structure PTM dataset, which serves as the foundation for fair comparison. We introduce the MeToken model, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens. This model not only captures the typical sequence motifs associated with PTMs but also leverages the spatial arrangements dictated by protein tertiary structures, thus providing a holistic view of the factors influencing PTM sites. Designed to address the long-tail distribution of PTM types, MeToken employs uniform sub-codebooks that ensure even the rarest PTMs are adequately represented and distinguished. We validate the effectiveness and generalizability of MeToken across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight MeToken's potential in facilitating accurate and comprehensive PTM predictions, which could significantly impact proteomics research. The code and datasets are available at https://github.com/A4Bio/MeToken.

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

TL;DR

The MeToken model is introduced, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens, providing a holistic view of the factors influencing PTM sites.

Abstract

Post-translational modifications (PTMs) profoundly expand the complexity and functionality of the proteome, regulating protein attributes and interactions that are crucial for biological processes. Accurately predicting PTM sites and their specific types is therefore essential for elucidating protein function and understanding disease mechanisms. Existing computational approaches predominantly focus on protein sequences to predict PTM sites, driven by the recognition of sequence-dependent motifs. However, these approaches often overlook protein structural contexts. In this work, we first compile a large-scale sequence-structure PTM dataset, which serves as the foundation for fair comparison. We introduce the MeToken model, which tokenizes the micro-environment of each amino acid, integrating both sequence and structural information into unified discrete tokens. This model not only captures the typical sequence motifs associated with PTMs but also leverages the spatial arrangements dictated by protein tertiary structures, thus providing a holistic view of the factors influencing PTM sites. Designed to address the long-tail distribution of PTM types, MeToken employs uniform sub-codebooks that ensure even the rarest PTMs are adequately represented and distinguished. We validate the effectiveness and generalizability of MeToken across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight MeToken's potential in facilitating accurate and comprehensive PTM predictions, which could significantly impact proteomics research. The code and datasets are available at https://github.com/A4Bio/MeToken.

Paper Structure

This paper contains 53 sections, 16 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: The comparison of sequence-based and MeToken schemes. While sequence-based methods focus on sequence motifs around modification sites, MeToken first encodes the micro-environment at both sequence and structure levels and predicts the PTM types with token embeddings.
  • Figure 2: Illustration of the micro-environment of residue $i$. This figure depicts the local neighborhood of residue $i$, highlighted in red, which is interconnected through various types of edges: sequential edges (blue), $R$-radius edges (pink), and $K$-nearest edges (green). The entire micro-environment, including residue $i$ and its interconnected neighbors, is then tokenized into a unified representation token for PTM prediction.
  • Figure 3: The long-tail distributed PTM types are projected into a uniformly distributed token space. Red arrows depict the consolidation of token embeddings within each sub-codebook, enhancing intra-class similarity. In contrast, gray arrows represent the dispersion of token embeddings across different sub-codebooks, promoting inter-class distinctiveness.
  • Figure 4: Vanilla VQ looks up the codebook and hard-assigns the nearest code, while temperature-scaled VQ employs a softer, probabilistic assignment approach where codebook vectors are assigned based on weights. These weights are modulated by $\tau_v$, which adjusts the sharpness throughout the training process, transitioning from exploratory to more deterministic assignments as $\tau_v$ decreases.
  • Figure 5: The PTM prediction process with the learned codebook.
  • ...and 15 more figures