Table of Contents
Fetching ...

Learning the PTM Code through a Coarse-to-Fine, Mechanism-Aware Framework

Jingjie Zhang, Hanqun Cao, Zijun Gao, Yu Wang, Shaoning Li, Jun Xu, Cheng Tan, Jun Zhu, Chang-Yu Hsieh, Chunbin Gu, Pheng Ann Heng

TL;DR

COMPASS-PTM introduces a two-stage, mechanism-aware framework that unifies proteome-scale PTM site profiling with enzyme-substrate pairing. It leverages a dual-modal encoder (PLM+CLM) and a crosstalk-aware prompting mechanism to model PTM dependencies, addressing the dual long-tail data challenge. The Stage 2 ESPS module couples refined substrate representations with enzyme embeddings via a dual-gated fusion to predict cognate enzymes, enabling zero-shot generalization to unseen kinases. Across multiple benchmarks, the approach achieves state-of-the-art performance, recovers canonical kinase motifs, and provides mechanistic, disease-relevant predictions, bridging statistical learning with biochemical regulation. This integration of interpretable, mechanism-informed predictions offers a powerful platform for decoding the PTM code and guiding experimental validation and translational research.

Abstract

Post-translational modifications (PTMs) form a combinatorial "code" that regulates protein function, yet deciphering this code - linking modified sites to their catalytic enzymes - remains a central unsolved problem in understanding cellular signaling and disease. We introduce COMPASS-PTM, a mechanism-aware, coarse-to-fine learning framework that unifies residue-level PTM profiling with enzyme-substrate assignment. COMPASS-PTM integrates evolutionary representations from protein language models with physicochemical priors and a crosstalk-aware prompting mechanism that explicitly models inter-PTM dependencies. This design allows the model to learn biologically coherent patterns of cooperative and antagonistic modifications while addressing the dual long-tail distribution of PTM data. Across multiple proteome-scale benchmarks, COMPASS-PTM establishes new state-of-the-art performance, including a 122% relative F1 improvement in multi-label site prediction and a 54% gain in zero-shot enzyme assignment. Beyond accuracy, the model demonstrates interpretable generalization, recovering canonical kinase motifs and predicting disease-associated PTM rewiring caused by missense variants. By bridging statistical learning with biochemical mechanism, COMPASS-PTM unifies site-level and enzyme-level prediction into a single framework that learns the grammar underlying protein regulation and signaling.

Learning the PTM Code through a Coarse-to-Fine, Mechanism-Aware Framework

TL;DR

COMPASS-PTM introduces a two-stage, mechanism-aware framework that unifies proteome-scale PTM site profiling with enzyme-substrate pairing. It leverages a dual-modal encoder (PLM+CLM) and a crosstalk-aware prompting mechanism to model PTM dependencies, addressing the dual long-tail data challenge. The Stage 2 ESPS module couples refined substrate representations with enzyme embeddings via a dual-gated fusion to predict cognate enzymes, enabling zero-shot generalization to unseen kinases. Across multiple benchmarks, the approach achieves state-of-the-art performance, recovers canonical kinase motifs, and provides mechanistic, disease-relevant predictions, bridging statistical learning with biochemical regulation. This integration of interpretable, mechanism-informed predictions offers a powerful platform for decoding the PTM code and guiding experimental validation and translational research.

Abstract

Post-translational modifications (PTMs) form a combinatorial "code" that regulates protein function, yet deciphering this code - linking modified sites to their catalytic enzymes - remains a central unsolved problem in understanding cellular signaling and disease. We introduce COMPASS-PTM, a mechanism-aware, coarse-to-fine learning framework that unifies residue-level PTM profiling with enzyme-substrate assignment. COMPASS-PTM integrates evolutionary representations from protein language models with physicochemical priors and a crosstalk-aware prompting mechanism that explicitly models inter-PTM dependencies. This design allows the model to learn biologically coherent patterns of cooperative and antagonistic modifications while addressing the dual long-tail distribution of PTM data. Across multiple proteome-scale benchmarks, COMPASS-PTM establishes new state-of-the-art performance, including a 122% relative F1 improvement in multi-label site prediction and a 54% gain in zero-shot enzyme assignment. Beyond accuracy, the model demonstrates interpretable generalization, recovering canonical kinase motifs and predicting disease-associated PTM rewiring caused by missense variants. By bridging statistical learning with biochemical mechanism, COMPASS-PTM unifies site-level and enzyme-level prediction into a single framework that learns the grammar underlying protein regulation and signaling.

Paper Structure

This paper contains 57 sections, 24 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The COMPASS-PTM framework for mechanism-aware PTM prediction.a, Conceptual overview of the two-stage, coarse-to-fine framework. Stage 1, the Multi-label Site Profiling Network (MSPN), performs proteome-scale prediction to identify potential PTM sites and their likely modification types. Stage 2, the Enzyme-Substrate Pairing System (ESPS), then takes high-confidence sites from Stage 1 and predicts their cognate enzymes, providing mechanistic context. b, Architecture of the Stage 1 MSPN. A dual-modal encoder fuses representations from a Protein Language Model (PLM), efficiently adapted via LoRA hu2022lora, and a Chemical Language Model (CLM). A key innovation, the crosstalk-aware prompting module, uses a PTM relationship matrix—initialized from co-occurrence statistics in the PTMCode2 database minguez2015ptmcode—as a learnable inductive bias to guide the prediction of complex, interdependent PTM patterns. c, Architecture of the Stage 2 ESPS. For a given substrate and enzyme pair, the system integrates the PTM-aware substrate embedding generated by the MSPN with a new enzyme embedding to predict their interaction probability, thus linking a modification site to its specific regulator.
  • Figure 2: MSPN demonstrates state-of-the-art performance and robust generalization across benchmark datasets.a, Performance comparison with four baselines yan2023mindtan2024metokenwang2020musitedeepzhang2025sagephos on the dbPTM-ML multi-label benchmark derived from dbPTM database chung2025dbptm. The radar chart displays macro-averaged scores across four key metrics, while the bar plots provide detailed results for each individual metric, with error bars showing standard deviation. b, Performance comparison on the second multi-label benchmark derived from qPTM database yu2023qptm, qPTM-ML. c, Zero-shot generalization performance on the PTMint-MC multi-class benchmark derived from PTMint database hong2023ptmint, where the model trained on dbPTM-ML was evaluated directly without fine-tuning. d, Cross-task generalization performance on five binary classification datasets from the PTMGPT2 study shrestha2024post. For panels a, b, and d, results were averaged across three independent runs with different random seeds. Panel c represents a single, direct inference experiment.
  • Figure 3: ESPS demonstrates robust and high-fidelity performance in enzyme-substrate prediction across diverse benchmarks.a-c, Performance comparison against competing methods pourmirzaei2025predictingma2023kinasephoszhang2025sagephoszhou2024using on the OmniPath benchmark turei2016omnipath across three distinct data splitting strategies: (a) warm-start, (b) substrate cold-start, and (c) enzyme cold-start. The radar chart summarizes performance across four key metrics, while the bar plots show detailed scores for each. d, Independent validation on the SAGEPhos benchmark zhang2025sagephos, which uses a distinct suite of metrics including 1-False Positive Rate (1-FPR). All results are the mean of three independent runs with different random seeds; error bars represent standard deviation.
  • Figure 4: Visualizing the learned representations of the COMPASS-PTM framework.a,b, UMAP projection mcinnes2018umap of the Stage 1 embedding space for single-PTM (a) and multi-PTM (b) sites, shown before (left) and after (right) training. c,d, Principal Component Analysis (PCA) mackiewicz1993principal of Stage 2 embeddings, organized by kinase family, for the OmniPath (c) and SAGEPhos (d) datasets. For each dataset, clustering based on substrate-only features (left) is coarse, but becomes highly resolved after fusion with enzyme-specific information (right), demonstrating a coarse-to-fine learning dynamic. e, Canonical kinase recognition motifs recovered from the model's top-100 high-confidence predictions for four major kinase families. The model independently rediscovers the well-established biochemical signatures for each family—including the basophilic motifs of AGC and CAMK, the proline-directed motif of CMGC, and the tyrosine-specific motif of TK—validating that its predictions are grounded in the principles of molecular recognition.
  • Figure 5: COMPASS-PTM generates mechanistic hypotheses by predicting the PTM consequences of pathogenic variants.a, The Liddle syndrome-associated p.P616L substitution in SCNN1B is predicted to cause a loss of phosphorylation at the adjacent T615 site (predicted probability drops from 0.70 to 0.27) gwozdzinska2017hypercapnia. b, The amyotrophic lateral sclerosis (ALS)-associated p.R524S variant in FUS is predicted to induce a gain of phosphorylation at the nearby Y526 site (probability increases from 0.20 to 0.57) zhang2012structuraldarovic2015phosphorylation. c, Two Parkinson's disease-associated variants in LRRK2, p.R1441C and p.R1628P, are both predicted to result in a loss of phosphorylation at their respective proximal sites muda2014parkinson. For each case, the diagrams on the left illustrate the local sequence and predicted PTM probabilities, where the pathogenic substitution is highlighted in red and the affected PTM site is in green. The diagrams on the right summarize the underlying genetic alteration and resulting pathology.
  • ...and 7 more figures