Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL
Arturo Fiorellini-Bernardis, Sebastien Boyer, Christoph Brunken, Bakary Diallo, Karim Beguir, Nicolas Lopez-Carranza, Oliver Bent
TL;DR
eGRAL addresses the challenge of predicting binding-affinity changes under multiple amino acid substitutions by employing SE(3) equivariant GNNs that integrate atomic, residue, and evolutionary features. It leverages a large simulated Rosetta-based ΔΔG dataset for pretraining and then fine-tunes on experimental SKEMPIcl data using LoRA, achieving improved predictive performance especially for single- and some multi-mutants, and offering faster inference than Rosetta. While incorporating ESM2 embeddings increases model expressivity, it can lead to overfitting in limited data regimes, underscoring the need for larger or more diverse pretraining data and possibly mutated-structure information in the future. Overall, eGRAL demonstrates strong potential for rapid, multiscale prediction of mutation-driven binding changes and provides a framework for integrating language-model information with structural graphs in protein interaction modelling.
Abstract
Protein-protein interactions (PPIs) play a crucial role in numerous biological processes. Developing methods that predict binding affinity changes under substitution mutations is fundamental for modelling and re-engineering biological systems. Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations. With this contribution, we propose eGRAL, a novel SE(3) equivariant graph neural network (eGNN) architecture designed for predicting binding affinity changes from multiple amino acid substitutions in protein complexes. eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models. To address the limited availability of large-scale affinity assays with structural information, we generate a simulated dataset comprising approximately 500,000 data points. Our model is pre-trained on this dataset, then fine-tuned and tested on experimental data.
