Table of Contents
Fetching ...

MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding

Lirong Wu, Yijun Tian, Yufei Huang, Siyuan Li, Haitao Lin, Nitesh V Chawla, Stan Z. Li

TL;DR

MAPE-PPI tackles large-scale PPI prediction by learning a microenvironment-aware protein embedding via a large discrete codebook learned with a VQ-VAE variant. It pretrains the codebook using Masked Codebook Modeling (MCM) to capture dependencies among microenvironments, then freezes the learned embeddings as fixed features for scalable PPI graph reasoning with a GIN backbone. The approach yields superior accuracy-efficiency trade-offs over both sequence- and structure-based baselines across multiple datasets and remains robust under domain shifts and structural perturbations. This framework enables efficient, structure-informed PPI prediction at million-scale, with potential extensions to interface and conformational modeling.

Abstract

Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. To take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the "vocabulary" is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary" (i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.

MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding

TL;DR

MAPE-PPI tackles large-scale PPI prediction by learning a microenvironment-aware protein embedding via a large discrete codebook learned with a VQ-VAE variant. It pretrains the codebook using Masked Codebook Modeling (MCM) to capture dependencies among microenvironments, then freezes the learned embeddings as fixed features for scalable PPI graph reasoning with a GIN backbone. The approach yields superior accuracy-efficiency trade-offs over both sequence- and structure-based baselines across multiple datasets and remains robust under domain shifts and structural perturbations. This framework enables efficient, structure-informed PPI prediction at million-scale, with potential extensions to interface and conformational modeling.

Abstract

Protein-Protein Interactions (PPIs) are fundamental in various biological processes and play a key role in life activities. The growing demand and cost of experimental PPI assays require computational methods for efficient PPI prediction. While existing methods rely heavily on protein sequence for PPI prediction, it is the protein structure that is the key to determine the interactions. To take both protein modalities into account, we define the microenvironment of an amino acid residue by its sequence and structural contexts, which describe the surrounding chemical properties and geometric features. In addition, microenvironments defined in previous work are largely based on experimentally assayed physicochemical properties, for which the "vocabulary" is usually extremely small. This makes it difficult to cover the diversity and complexity of microenvironments. In this paper, we propose Microenvironment-Aware Protein Embedding for PPI prediction (MPAE-PPI), which encodes microenvironments into chemically meaningful discrete codes via a sufficiently large microenvironment "vocabulary" (i.e., codebook). Moreover, we propose a novel pre-training strategy, namely Masked Codebook Modeling (MCM), to capture the dependencies between different microenvironments by randomly masking the codebook and reconstructing the input. With the learned microenvironment codebook, we can reuse it as an off-the-shelf tool to efficiently and effectively encode proteins of different sizes and functions for large-scale PPI prediction. Extensive experiments show that MAPE-PPI can scale to PPI prediction with millions of PPIs with superior trade-offs between effectiveness and computational efficiency than the state-of-the-art competitors.
Paper Structure (18 sections, 10 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 10 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Micro-F1 vs. Training Time.
  • Figure 2: Left: Illustration of microenvironment discovery and microenvironment-aware protein embedding. Right: Illustration of pre-training the codebook by Masked Codebook Modeling (MCM).
  • Figure 3: Illustration of pre-training the encoder and codebook for efficient PPI prediction, where the flame and lock icons indicate that the module is optimizable or parameter-frozen, respectively.
  • Figure 4: (a) Generalization performance comparison of testing on unseen trainset-heterogenous test data under different data partitions. (bc) Robustness evaluation on protein 3D structures with different accuracy, measured by Root Mean Square Deviation (RMSD), on the SHS27k dataset.
  • Figure 5: (a) Visualization (by UMAP) of the embeddings of four microenvironment codes and corresponding residues on the SHS27k dataset. (b) Distribution of amino acids within each microenvironment code on the SHS27k. (c) Distribution of amino acids primarily encoded by each microenvironmental code, as well as the distribution of amino acids in the real-world SHS27k dataset.
  • ...and 1 more figures