Table of Contents
Fetching ...

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

Hazem Alsamkary, Mohamed Elshaffei, Mohamed Soudy, Sara Ossman, Abdallah Amr, Nehal Adel Abdelsalam, Mohamed Elkerdawy, Ahmed Elnaggar

TL;DR

This paper addresses sequence-based PPI binding-affinity prediction with PLMs by first building a rigorously curated PPB-Affinity dataset using a strict $\leq 30\%$ sequence-identity split to minimize leakage. It then systematically compares four PLM-adaptation architectures—EC, SC, HP, and PAD—across multiple PLMs and two training regimes (full fine-tuning and ConvBERT heads). The study finds that hierarchical pooling (HP) and pooled attention addition (PAD) consistently outperform simple concatenation methods, with Spearman correlations improving by up to about $12\%$ and peak test $\rho$ around $0.48$. This work underscores the importance of architecture design and data quality for leveraging PLMs in multi-chain PPI binding prediction and points to future directions that include hyperparameter optimization and integrating predicted structural information for multi-modal modeling.

Abstract

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

Beyond Simple Concatenation: Fairly Assessing PLM Architectures for Multi-Chain Protein-Protein Interactions Prediction

TL;DR

This paper addresses sequence-based PPI binding-affinity prediction with PLMs by first building a rigorously curated PPB-Affinity dataset using a strict sequence-identity split to minimize leakage. It then systematically compares four PLM-adaptation architectures—EC, SC, HP, and PAD—across multiple PLMs and two training regimes (full fine-tuning and ConvBERT heads). The study finds that hierarchical pooling (HP) and pooled attention addition (PAD) consistently outperform simple concatenation methods, with Spearman correlations improving by up to about and peak test around . This work underscores the importance of architecture design and data quality for leveraging PLMs in multi-chain PPI binding prediction and points to future directions that include hyperparameter optimization and integrating predicted structural information for multi-modal modeling.

Abstract

Protein-protein interactions (PPIs) are fundamental to numerous cellular processes, and their characterization is vital for understanding disease mechanisms and guiding drug discovery. While protein language models (PLMs) have demonstrated remarkable success in predicting protein structure and function, their application to sequence-based PPI binding affinity prediction remains relatively underexplored. This gap is often attributed to the scarcity of high-quality, rigorously refined datasets and the reliance on simple strategies for concatenating protein representations. In this work, we address these limitations. First, we introduce a meticulously curated version of the PPB-Affinity dataset of a total of 8,207 unique protein-protein interaction entries, by resolving annotation inconsistencies and duplicate entries for multi-chain protein interactions. This dataset incorporates a stringent, less than or equal to 30%, sequence identity threshold to ensure robust splitting into training, validation, and test sets, minimizing data leakage. Second, we propose and systematically evaluate four architectures for adapting PLMs to PPI binding affinity prediction: embeddings concatenation (EC), sequences concatenation (SC), hierarchical pooling (HP), and pooled attention addition (PAD). These architectures were assessed using two training methods: full fine-tuning and a lightweight approach employing ConvBERT heads over frozen PLM features. Our comprehensive experiments across multiple leading PLMs (ProtT5, ESM2, Ankh, Ankh2, and ESM3) demonstrated that the HP and PAD architectures consistently outperform conventional concatenation methods, achieving up to 12% increase in terms of Spearman correlation. These results highlight the necessity of sophisticated architectural designs to fully exploit the capabilities of PLMs for nuanced PPI binding affinity prediction.

Paper Structure

This paper contains 17 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Architectures used to adapt protein language models to the binding-affinity prediction task. Dimensions displayed on each block denote component output dimensions; all weights were shared across parallel ligand and receptor processing pathways
  • Figure 2: Heatmap of test set Spearman $\rho$ (each value averaged over 3 seeds): PLMs vs. setups for binding affinity prediction. Marginal means show average $\rho$ per PLM (last column) and per setup (last row). PAD: Pooled Attention Addition; HP: Hierarchical Pooling; SC: Sequences Concatenation; EC: Embeddings Concatenation
  • Figure 3: Sequences Concatenation Architecture
  • Figure 4: Global $1D$ Attention Pooler Architecture. A linear layer transforms the input sequence of hidden states (each of dimension $E_{dim}$) into a vector of scalar attention scores, one per hidden state. These scores are subsequently normalized via a softmax function to produce attention weights. The final pooled output, a single vector of dimension $E_{dim}$, is computed as the weighted sum of the original hidden states using these attention weights. Dimensions displayed on each block denote the output dimensions of that component