Table of Contents
Fetching ...

Protein Multimer Structure Prediction via Prompt Learning

Ziqi Gao, Xiangguo Sun, Zijing Liu, Yu Li, Hong Cheng, Jia Li

TL;DR

PromptMSP tackles the challenge of predicting protein multimer structures across varied chain counts by transferring conditional PPI knowledge through learnable prompts. It frames MSP as a two-task pipeline: a source graph-level regression task to pre-train a GNN and a target task that reformulates conditional docking as a fixed-scale graph problem via a cross-attention prompt, enabling efficient N−1 step assembly. The authors introduce a meta-learning-based prompt initialization to improve adaptation under data scarcity and demonstrate superior RMSD and TM-Score, along with faster inference, across small- and large-scale multimers on the PDB-M dataset. The work highlights the importance of modeling C-PPI over I-PPI and shows how prompt design, grounded in the ell=3 PPI rule, yields better generalization across scales, with code and data publicly released. Overall, PromptMSP advances MSP by combining pre-training on small-scale data, principled task reformulation via prompts, and fast, scalable inference suitable for protein engineering workflows.

Abstract

Understanding the 3D structures of protein multimers is crucial, as they play a vital role in regulating various cellular processes. It has been empirically confirmed that the multimer structure prediction~(MSP) can be well handled in a step-wise assembly fashion using provided dimer structures and predicted protein-protein interactions~(PPIs). However, due to the biological gap in the formation of dimers and larger multimers, directly applying PPI prediction techniques can often cause a \textit{poor generalization} to the MSP task. To address this challenge, we aim to extend the PPI knowledge to multimers of different scales~(i.e., chain numbers). Specifically, we propose \textbf{\textsc{PromptMSP}}, a pre-training and \textbf{Prompt} tuning framework for \textbf{M}ultimer \textbf{S}tructure \textbf{P}rediction. First, we tailor the source and target tasks for effective PPI knowledge learning and efficient inference, respectively. We design PPI-inspired prompt learning to narrow the gaps of two task formats and generalize the PPI knowledge to multimers of different scales. We provide a meta-learning strategy to learn a reliable initialization of the prompt model, enabling our prompting framework to effectively adapt to limited data for large-scale multimers. Empirically, we achieve both significant accuracy (RMSD and TM-Score) and efficiency improvements compared to advanced MSP models. The code, data and checkpoints are released at \url{https://github.com/zqgao22/PromptMSP}.

Protein Multimer Structure Prediction via Prompt Learning

TL;DR

PromptMSP tackles the challenge of predicting protein multimer structures across varied chain counts by transferring conditional PPI knowledge through learnable prompts. It frames MSP as a two-task pipeline: a source graph-level regression task to pre-train a GNN and a target task that reformulates conditional docking as a fixed-scale graph problem via a cross-attention prompt, enabling efficient N−1 step assembly. The authors introduce a meta-learning-based prompt initialization to improve adaptation under data scarcity and demonstrate superior RMSD and TM-Score, along with faster inference, across small- and large-scale multimers on the PDB-M dataset. The work highlights the importance of modeling C-PPI over I-PPI and shows how prompt design, grounded in the ell=3 PPI rule, yields better generalization across scales, with code and data publicly released. Overall, PromptMSP advances MSP by combining pre-training on small-scale data, principled task reformulation via prompts, and fast, scalable inference suitable for protein engineering workflows.

Abstract

Understanding the 3D structures of protein multimers is crucial, as they play a vital role in regulating various cellular processes. It has been empirically confirmed that the multimer structure prediction~(MSP) can be well handled in a step-wise assembly fashion using provided dimer structures and predicted protein-protein interactions~(PPIs). However, due to the biological gap in the formation of dimers and larger multimers, directly applying PPI prediction techniques can often cause a \textit{poor generalization} to the MSP task. To address this challenge, we aim to extend the PPI knowledge to multimers of different scales~(i.e., chain numbers). Specifically, we propose \textbf{\textsc{PromptMSP}}, a pre-training and \textbf{Prompt} tuning framework for \textbf{M}ultimer \textbf{S}tructure \textbf{P}rediction. First, we tailor the source and target tasks for effective PPI knowledge learning and efficient inference, respectively. We design PPI-inspired prompt learning to narrow the gaps of two task formats and generalize the PPI knowledge to multimers of different scales. We provide a meta-learning strategy to learn a reliable initialization of the prompt model, enabling our prompting framework to effectively adapt to limited data for large-scale multimers. Empirically, we achieve both significant accuracy (RMSD and TM-Score) and efficiency improvements compared to advanced MSP models. The code, data and checkpoints are released at \url{https://github.com/zqgao22/PromptMSP}.
Paper Structure (46 sections, 14 equations, 13 figures, 6 tables, 3 algorithms)

This paper contains 46 sections, 14 equations, 13 figures, 6 tables, 3 algorithms.

Figures (13)

  • Figure 1: (A). Step-wise assembly for MSP. (B). Motivation for extending I-PPI to C-PPI.
  • Figure 2: Distribution in chain numbers of multimers from the PDB database.
  • Figure 3: Assembly process with the predicted assembly graph and prepared dimers.
  • Figure 4: Analysis on multimers with varied chain numbers. We select some samples for evaluation and visualize heatmaps that show the similarity of the sample embeddings obtained from different pre-trained models. Each value on the axis suggests that the model is trained on data with the specific chain number or degree value. For example in the heatmap titled 'Source task: chain', the darkness at the [5,7] block represents the similarity between the embeddings extracted from two models that are trained under the source task with 5- and 7-chain multimers, respectively.
  • Figure 5: The overview of PromptMSP.(A). Firstly, we pre-train the GIN encoder and the task head under the graph-level regression task. After pre-training, given an arbitrary graph, $\theta^*$ and $\phi^*$ jointly output the correctness. (B). During prompt tuning, the prompt model takes embeddings of a pair of docked and undocked (query) chains as input and learns to produce prompt embeddings which form the entire 4-node path. $\theta^*$ and $\phi^*$ then jointly predict the correctness, which is equivalent to the linking probability. We use $f_{\theta^*,\phi^*,\pi^*}$ to denote the well trained pipeline that outputs the linking probability of query chains with target data instance as input. (C). If the target multimer has 9 chains, we sequentially perform 8 steps for inference. In each step, we use the well trained pipeline to calculate the probabilities for all possible chain pairs and select the most possible pair to assemble.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 1: Assembly Correctness
  • Definition 2: Source Data $\mathcal{D}_{sou}$
  • Definition 3: Target Data $\mathcal{D}_{tar}$