Table of Contents
Fetching ...

Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL

Arturo Fiorellini-Bernardis, Sebastien Boyer, Christoph Brunken, Bakary Diallo, Karim Beguir, Nicolas Lopez-Carranza, Oliver Bent

TL;DR

eGRAL addresses the challenge of predicting binding-affinity changes under multiple amino acid substitutions by employing SE(3) equivariant GNNs that integrate atomic, residue, and evolutionary features. It leverages a large simulated Rosetta-based ΔΔG dataset for pretraining and then fine-tunes on experimental SKEMPIcl data using LoRA, achieving improved predictive performance especially for single- and some multi-mutants, and offering faster inference than Rosetta. While incorporating ESM2 embeddings increases model expressivity, it can lead to overfitting in limited data regimes, underscoring the need for larger or more diverse pretraining data and possibly mutated-structure information in the future. Overall, eGRAL demonstrates strong potential for rapid, multiscale prediction of mutation-driven binding changes and provides a framework for integrating language-model information with structural graphs in protein interaction modelling.

Abstract

Protein-protein interactions (PPIs) play a crucial role in numerous biological processes. Developing methods that predict binding affinity changes under substitution mutations is fundamental for modelling and re-engineering biological systems. Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations. With this contribution, we propose eGRAL, a novel SE(3) equivariant graph neural network (eGNN) architecture designed for predicting binding affinity changes from multiple amino acid substitutions in protein complexes. eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models. To address the limited availability of large-scale affinity assays with structural information, we generate a simulated dataset comprising approximately 500,000 data points. Our model is pre-trained on this dataset, then fine-tuned and tested on experimental data.

Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL

TL;DR

eGRAL addresses the challenge of predicting binding-affinity changes under multiple amino acid substitutions by employing SE(3) equivariant GNNs that integrate atomic, residue, and evolutionary features. It leverages a large simulated Rosetta-based ΔΔG dataset for pretraining and then fine-tunes on experimental SKEMPIcl data using LoRA, achieving improved predictive performance especially for single- and some multi-mutants, and offering faster inference than Rosetta. While incorporating ESM2 embeddings increases model expressivity, it can lead to overfitting in limited data regimes, underscoring the need for larger or more diverse pretraining data and possibly mutated-structure information in the future. Overall, eGRAL demonstrates strong potential for rapid, multiscale prediction of mutation-driven binding changes and provides a framework for integrating language-model information with structural graphs in protein interaction modelling.

Abstract

Protein-protein interactions (PPIs) play a crucial role in numerous biological processes. Developing methods that predict binding affinity changes under substitution mutations is fundamental for modelling and re-engineering biological systems. Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations. With this contribution, we propose eGRAL, a novel SE(3) equivariant graph neural network (eGNN) architecture designed for predicting binding affinity changes from multiple amino acid substitutions in protein complexes. eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models. To address the limited availability of large-scale affinity assays with structural information, we generate a simulated dataset comprising approximately 500,000 data points. Our model is pre-trained on this dataset, then fine-tuned and tested on experimental data.
Paper Structure (22 sections, 11 figures, 6 tables)

This paper contains 22 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Schematic of the residue graphs. Partners AB and C refer to the individual proteins (or sub-units) that interact to form the complex. Each node describes an amino acid represented by multiple features, and edges are drawn between nodes within 9 Å. The nodes can include ESM2-generated features while the edges include information on how the partners interact.
  • Figure 2: Performance of fine-tuned eGRAL-ESM (top row) and eGRAL-noESM (bottom row). Results are broken down as follows; (a) per PDB; (b) per number of mutations for SKEMPIcl,test; and (c) per number of mutations for RBDtest. The Pearson correlation coefficient ($\rho$) is reported. Marker size is proportional to the number of points used to calculate the correlation. The colouring refers to the different PDB IDs or number of mutations. In column (b) and (c), the numbering refers to number of mutations. The vertical line marks a significant p-value of 0.05.
  • Figure 3: $\Delta\Delta G$ predicted by the Rosetta-based scorer against the experimental values in SKEMPIcl for the intersection between ROSETTAsim and SKEMPIcl.
  • Figure 4: Frequency of data points in SKEMPIcl per number of mutation in different splits. Namely from left to right: training, validation and test split.
  • Figure 5: Predicted results against ground truth by pre-trained eGRAL-noESM (top row) and eGRAL-ESM (bottom row) on ROSETTAsim,test: the predictions are reported for 4 PDB IDs. Scores and RMSE are expressed in kcal/mol. Pearson and Spearman correlation coefficients and number of data points are also reported.
  • ...and 6 more figures