Table of Contents
Fetching ...

Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning

Lirong Wu, Yijun Tian, Haitao Lin, Yufei Huang, Siyuan Li, Nitesh V Chawla, Stan Z. Li

TL;DR

This work tackles predicting mutational effects on protein-protein interactions, framed as $\Delta\Delta G$ prediction under limited labeled data. It introduces Prompt-DDG, a microenvironment-aware hierarchical prompt learning framework comprising a three-scale prompt codebook and a masked microenvironment modeling objective to pre-train the codebook, followed by lightweight per-mutation prompt adaptation to guide $\Delta\Delta G$ prediction. Across SKEMPI v2.0 and an antibody optimization case against SARS-CoV-2, Prompt-DDG achieves state-of-the-art or near state-of-the-art accuracy with improved training efficiency, despite not requiring extra pre-training data. The approach provides interpretable, mutation-specific prompts that encode multi-scale structural context, offering practical benefits for antibody design, mutational scanning, and data augmentation in protein engineering.

Abstract

Protein-protein bindings play a key role in a variety of fundamental biological processes, and thus predicting the effects of amino acid mutations on protein-protein binding is crucial. To tackle the scarcity of annotated mutation data, pre-training with massive unlabeled data has emerged as a promising solution. However, this process faces a series of challenges: (1) complex higher-order dependencies among multiple (more than paired) structural scales have not yet been fully captured; (2) it is rarely explored how mutations alter the local conformation of the surrounding microenvironment; (3) pre-training is costly, both in data size and computational burden. In this paper, we first construct a hierarchical prompt codebook to record common microenvironmental patterns at different structural scales independently. Then, we develop a novel codebook pre-training task, namely masked microenvironment modeling, to model the joint distribution of each mutation with their residue types, angular statistics, and local conformational changes in the microenvironment. With the constructed prompt codebook, we encode the microenvironment around each mutation into multiple hierarchical prompts and combine them to flexibly provide information to wild-type and mutated protein complexes about their microenvironmental differences. Such a hierarchical prompt learning framework has demonstrated superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction and a case study of optimizing human antibodies against SARS-CoV-2.

Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning

TL;DR

This work tackles predicting mutational effects on protein-protein interactions, framed as prediction under limited labeled data. It introduces Prompt-DDG, a microenvironment-aware hierarchical prompt learning framework comprising a three-scale prompt codebook and a masked microenvironment modeling objective to pre-train the codebook, followed by lightweight per-mutation prompt adaptation to guide prediction. Across SKEMPI v2.0 and an antibody optimization case against SARS-CoV-2, Prompt-DDG achieves state-of-the-art or near state-of-the-art accuracy with improved training efficiency, despite not requiring extra pre-training data. The approach provides interpretable, mutation-specific prompts that encode multi-scale structural context, offering practical benefits for antibody design, mutational scanning, and data augmentation in protein engineering.

Abstract

Protein-protein bindings play a key role in a variety of fundamental biological processes, and thus predicting the effects of amino acid mutations on protein-protein binding is crucial. To tackle the scarcity of annotated mutation data, pre-training with massive unlabeled data has emerged as a promising solution. However, this process faces a series of challenges: (1) complex higher-order dependencies among multiple (more than paired) structural scales have not yet been fully captured; (2) it is rarely explored how mutations alter the local conformation of the surrounding microenvironment; (3) pre-training is costly, both in data size and computational burden. In this paper, we first construct a hierarchical prompt codebook to record common microenvironmental patterns at different structural scales independently. Then, we develop a novel codebook pre-training task, namely masked microenvironment modeling, to model the joint distribution of each mutation with their residue types, angular statistics, and local conformational changes in the microenvironment. With the constructed prompt codebook, we encode the microenvironment around each mutation into multiple hierarchical prompts and combine them to flexibly provide information to wild-type and mutated protein complexes about their microenvironmental differences. Such a hierarchical prompt learning framework has demonstrated superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction and a case study of optimizing human antibodies against SARS-CoV-2.
Paper Structure (20 sections, 15 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 15 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Comparison of our Prompt-DDG with three state-of-the-art methods in effectiveness (per-structure Pearson and Spearman) and training efficiency (for pre-training and $\Delta\Delta G$ prediction). where Prompt-DDG outperforms the other methods a lot in both effectiveness and efficiency, especially the time spent on pre-training.
  • Figure 2: Left: A high-level overview of microenvironment-aware hierarchical prompt learning and adaptation framework for efficient $\Delta\Delta G$ prediction (Prompt-DDG). Right: Illustration of a hierarchical pre-training task by Masked Microenvironment Modeling (MMM).
  • Figure 3: Top (pre-training-based): concat pre-trained representation $\mathbf{h}_i$ of residue $v_i\in\mathcal{V}$ with the original feature $\mathbf{x}_i$. Below (prompt-guided): encode the microenvironment $\mathcal{G}_m$ around mutation $m\in\mathcal{M}$ into a prompt $\mathbf{p}_m$ and add it to each residue $v_i\in\mathcal{G}_m$.
  • Figure 4: A comparison of correlations between experimental $\Delta\Delta G$ and $\Delta\Delta G$ predicted by four representative methods.
  • Figure 5: Distributions of per-structure Pearson correlation scores and Spearman correlation scores for seven representative methods.
  • ...and 1 more figures