Table of Contents
Fetching ...

Bi-TEAM: A Unified Cross-Scale Representation Learning Framework for Chemically Modified Biomolecules

Chunbin Gu, Zijun Gao, Mutian He, Jingjie Zhang, Haipeng Wen, Zihao Luo, Xiaorui Wang, Hanqun Cao, Jiajun Bu, Chang-Yu Hsieh, Pheng Ann Heng

TL;DR

The introduction of Bi-TEAM (Bi-gated Residual Space Modification), a general framework that injects localized chemical variation into global protein contexts and provides a versatile foundation for machine learning driven exploration of peptide and protein biochemical space.

Abstract

Representation learning for protein biochemical space faces a difficult trade-off: protein language models excel at capturing long-range biological semantics but often miss fine-grained chemical details. Conversely, chemical language models encode atomic information but lack broader sequence context. To address this, we introduce Bi-TEAM (Bi-gated Residual Space Modification), a general framework that injects localized chemical variation into global protein contexts. By ensuring robustness against perturbations such as non-canonical amino acids, post-translational modifications (PTMs), and topological constraints, Bi-TEAM uncovers functional chemical dependencies often missed by evolutionary baselines. Mechanistically, Bi-TEAM maps non-canonical residues to their natural counterparts and injects atomic-level data via a bi-gated residual fusion mechanism. Crucially, this process uses modification-aware prompts to ensure that local structural changes influence global functional representations without requiring alphabet expansion. We evaluated Bi-TEAM on ten datasets spanning chemically modified peptides, PTMs, and natural proteins. The model consistently outperformed state-of-the-art baselines, achieving up to a 66 percent improvement in Matthews correlation coefficient (MCC) on scaffold-similarity splits and a 350 percent increase in hemolysis prediction accuracy. Furthermore, when deployed as an oracle for generative modeling, Bi-TEAM nearly quadrupled the success rate for designing cell-penetrating cyclic peptides. By unifying biological semantics with chemical precision, Bi-TEAM provides a versatile foundation for machine learning driven exploration of peptide and protein biochemical space.

Bi-TEAM: A Unified Cross-Scale Representation Learning Framework for Chemically Modified Biomolecules

TL;DR

The introduction of Bi-TEAM (Bi-gated Residual Space Modification), a general framework that injects localized chemical variation into global protein contexts and provides a versatile foundation for machine learning driven exploration of peptide and protein biochemical space.

Abstract

Representation learning for protein biochemical space faces a difficult trade-off: protein language models excel at capturing long-range biological semantics but often miss fine-grained chemical details. Conversely, chemical language models encode atomic information but lack broader sequence context. To address this, we introduce Bi-TEAM (Bi-gated Residual Space Modification), a general framework that injects localized chemical variation into global protein contexts. By ensuring robustness against perturbations such as non-canonical amino acids, post-translational modifications (PTMs), and topological constraints, Bi-TEAM uncovers functional chemical dependencies often missed by evolutionary baselines. Mechanistically, Bi-TEAM maps non-canonical residues to their natural counterparts and injects atomic-level data via a bi-gated residual fusion mechanism. Crucially, this process uses modification-aware prompts to ensure that local structural changes influence global functional representations without requiring alphabet expansion. We evaluated Bi-TEAM on ten datasets spanning chemically modified peptides, PTMs, and natural proteins. The model consistently outperformed state-of-the-art baselines, achieving up to a 66 percent improvement in Matthews correlation coefficient (MCC) on scaffold-similarity splits and a 350 percent increase in hemolysis prediction accuracy. Furthermore, when deployed as an oracle for generative modeling, Bi-TEAM nearly quadrupled the success rate for designing cell-penetrating cyclic peptides. By unifying biological semantics with chemical precision, Bi-TEAM provides a versatile foundation for machine learning driven exploration of peptide and protein biochemical space.
Paper Structure (48 sections, 14 equations, 15 figures)

This paper contains 48 sections, 14 equations, 15 figures.

Figures (15)

  • Figure 1: Overview of the Bi-TEAM framework.(a) Exploration of a unified protein representation space by integrating biological-based and chemical-based feature spaces. (b) Network architecture of Bi-TEAM focusing on multi-property prediction tasks, which comprehensively utilizes multi-modal information from the Protein Language Module, Chemical Language Module, and Modified Position module, exploring the optimal space by fusing these different modalities. (c) Conditional generation of modified peptides with specific properties, achieved by using Bi-TEAM to guide the BoltzDesign1. (d) Case studies illustrating generated modified peptides which possess superior membrane permeability.
  • Figure 2: Comparative analysis of model performance on different modified peptide datasets for permeability prediction.(a) Schematic illustration of the principle of cell membrane permeability. (b) Representative examples of modified peptides, including cyclic peptides and lariat peptides. (c) Performance evaluation on the modified ProPAMPA dataset li2023cycpeptmpdbgeylan2024methodology. (d) Generalization assessment using a principled cluster-based train-test partitioning strategy derived from fingerprint similarity distributions within the modified ProPAMPA dataset. (e–f) Inference results on the modified ProCocaPAMPA dataset bhardwaj2022accurateyu2024mucocp and CycPeptMPDB v1.2 li2023cycpeptmpdb, respectively, using models pre-trained on ProPAMPA. (g) Cross-entropy loss for membrane permeability prediction on ProPAMPA, ProCacoPAMPA, and Rezai datasets rezai2006conformational via direct inference using ProPAMPA-trained models, alongside accuracy metrics for the Rezai dataset. Radar charts display the average values across four metrics, while bar charts present detailed individual results with discrete data points; error bars indicate standard deviation. (Results for (d), (e), and (f) are identical across the five runs as they represent direct inference.)
  • Figure 3: Generalization assessment on PTM and natural protein datasets.(a, b) Predictive performance evaluation for PTM druggability across five metrics under random splitting (a) and similarity-based splitting (b) strategies. (c, d) Comparative prediction results for natural protein hemolysis (c) and solubility (d) using random data splits. (e, f) Corresponding performance assessments for hemolysis (e) and solubility (f) under similarity-based splits. For panels (a), (c), and (d), the mean, standard deviation, and specific values for five random runs are reported. The rightmost columns in (b), (e), and (f) present t-SNE visualizations of the data distributions under similarity-based splitting, where black and red crosses designate the training and test sets, respectively. (g--i) Schematic illustrations depicting the underlying mechanisms of PTM (g), hemolysis (h), and solubility (i).
  • Figure 4: Design of a non-invasive drug delivery system.(a) A schematic diagram of the ocular absorption pathway of the AFL/cyclic peptide complex: The red dashed circle represents the binding of the cyclic peptide-aflibercept complex to VEGF. After reaching the fundus, aflibercept is released to treat neovascularization, and the cyclic peptide will be degraded in serum or ocular tissue. (b) The success rate of the generated 1000 samples and the corresponding pLDDT and ipTM distributions. (c) From left to right, the performance radar chart of the cell-penetrating classification benchmark, the relationship between key hydrophobic amino acids and cell penetration probability and sample count for the 1000 generated samples, and the relationship between cyclic peptide length and cell penetration probability. (d) Aflibercept was displayed in electrostatic surface potentials colored red (-) and blue (+). (e) The structure of four top-ranked AFL/cyclic peptide complexes.
  • Figure 5: Ablation study results.(a)(b) Compare four metrics and MCC, respectively, for different PLMs, CLMs, and fusion methods (from left to right); (c)(e) Present ESM-based and Bi-TEAM models (with/without prompt) in terms of training loss, MCC, and ROC_AUC (from left to right); (d)(f) Shows three sets of example predictions for both ESM-based and Bi-TEAM-based models (with/without prompt).
  • ...and 10 more figures