Learning to design protein-protein interactions with enhanced generalization

Anton Bushuiev; Roman Bushuiev; Petr Kouba; Anatolii Filkin; Marketa Gabrielova; Michal Gabriel; Jiri Sedlar; Tomas Pluskal; Jiri Damborsky; Stanislav Mazurenko; Josef Sivic

Learning to design protein-protein interactions with enhanced generalization

Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jiri Sedlar, Tomas Pluskal, Jiri Damborsky, Stanislav Mazurenko, Josef Sivic

TL;DR

The enhanced generalization of the new PPIformer approach is demonstrated by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.

Abstract

Discovering mutations enhancing protein-protein interactions (PPIs) is critical for advancing biomedical research and developing improved therapeutics. While machine learning approaches have substantially advanced the field, they often struggle to generalize beyond training data in practical scenarios. The contributions of this work are three-fold. First, we construct PPIRef, the largest and non-redundant dataset of 3D protein-protein interactions, enabling effective large-scale learning. Second, we leverage the PPIRef dataset to pre-train PPIformer, a new SE(3)-equivariant model generalizing across diverse protein-binder variants. We fine-tune PPIformer to predict effects of mutations on protein-protein interactions via a thermodynamically motivated adjustment of the pre-training loss function. Finally, we demonstrate the enhanced generalization of our new PPIformer approach by outperforming other state-of-the-art methods on new, non-leaking splits of standard labeled PPI mutational data and independent case studies optimizing a human antibody against SARS-CoV-2 and increasing the thrombolytic activity of staphylokinase.

Learning to design protein-protein interactions with enhanced generalization

TL;DR

Abstract

Paper Structure (55 sections, 6 equations, 7 figures, 7 tables, 2 algorithms)

This paper contains 55 sections, 6 equations, 7 figures, 7 tables, 2 algorithms.

Introduction
Related work
Predicting the effects of mutations on protein--protein interactions.
Self-supervised learning for protein design.
Datasets of protein--protein interactions.
PPIRef: New large dataset of protein--protein interactions
iDist: new efficient approach for protein--protein interface deduplication
Limitations of existing protein--protein interaction datasets
PPIRef: New large dataset of protein--protein interactions
Learning to design protein--protein interactions
Representation of protein--protein complexes
PPIformer model
3D equivariant self-supervised pre-training from unlabeled protein--protein interactions
Structural masking of protein--protein interfaces.
Loss for masked modeling of protein--protein interfaces.
...and 40 more sections

Figures (7)

Figure 1: Comparison of PPIRef with existing datasets of native protein complexes.
Figure 3: An example of protein--protein interfaces from different folds of the DIPS dataset detected as near duplicates by our iDist method. Both PPIs come from the same KatE enzyme homooligomers with different single-point mutations. Notably, the symmetry of the complex itself yields 3 groups of 2 duplicates (from a single PDB entry with 6 PPIs). Furthermore, querying PDB with "KatE" results in 36 KatE complexes, yielding, therefore, 3 groups of 72 duplicates each. Our iDist approach can efficiently identify such structural near duplicates on a large scale.
Figure 4: Overview of PPIformer. A single pre-training step starts with randomly sampling a protein--protein interaction $\mathbf{c}$ (in this example, the staphylokinase dimer A--B from the PDB entry 1C78) from PPIRef. Next, randomly selected residues $M$ are masked to obtain the masked interaction $\mathbf{c}_{\setminus M}$. After that, the interaction is converted into a graph representation $(G, \mathbf{X}, \mathbf{E}, \mathbf{F}_0, \mathbf{F}_1)$ with masked nodes $M$ (black circles). The model subsequently learns to classify the types of masked amino acids by acquiring $SE(3)$-invariant hidden representation $\mathbf{H}$ of the whole interface via the encoder $f$ and classifier $g$ (red arrows). On the downstream task of $\Delta \Delta G$ prediction, mutated amino acids are masked, and the probabilities of possible substitutions $\mathbf{P}_{M,:}$ are jointly inferred with the pre-trained model. Finally, the estimate $\widehat{\Delta \Delta G}$ is obtained using the predicted probabilities $p$ of the wild-type $c_i$ and the mutant $m_i$ amino acids via log odds (blue arrows).
Figure 5: Illustration of the single-step message passing of iDist (lines 5-12 in \ref{['alg:embed']}). A residue $i$ receives complementary distance-weighted messages $\mathbf{m}_{intra}$ and $\mathbf{m}_{inter}$ from all residues within the same protein $J_{intra}$ and other partners $J_{inter}$. The messages are aggregated into the embedding $\mathbf{h}_i$ and the procedure is repeated for each interface residue. iDist efficiently detects near-duplicate PPIs as the ones having similar averaged interface embeddings.
Figure 6: Benchmarking our efficient iDist approximation of the structural alignment algorithm iAlign. (I) Joint log-scale histogram displaying pair-wise iAlign (IS-score mode, $6\mathrm{\mathring{A}}$ cutoff) and iDist values of 1646 PPI interfaces (2,709,316 pairs) corresponding to 100 PDB codes sampled from DIPS townshend2019end. The iAlign values vary between 0 and 1, with high values corresponding to well-aligned interfaces (1 for identical interfaces) and low values corresponding to poorly-aligned interfaces. The iDist varies between 0 to infinity with high values corresponding to structurally-distant interfaces and low values corresponding to similar interfaces (0 for identical interfaces). Figures (III, IV, V) depict samples from regions in (I) where the two methods agree, while (II) shows an example of a disagreement. Each figure displays two interfaces colored by amino acid types, with one protein's palette in reddish hues and the other one in greenish hues. (II) Example of disagreement. The score of iAlign corresponds to the expected value of the alignment of two random PPIs, while iDist suggests higher similarity due to the identity of several fragments of chains M and N (note the $\varepsilon$-like green loop and its further continuation) and similar compositions of amino acids in the helices belonging to proteins E and D (similar combination of reddish colors). In fact, the two interfaces represent different interaction modes of the same two chains in a large symmetric complex. (III) Unrelated interfaces. (IV) Interfaces on the edge of being considered near duplicates. The interactions are obviously related, but the geometry and primary structure differ at every local fragment. (V) Near duplicates. The proteins are visualized using PyMOL delano2002pymol.
...and 2 more figures

Learning to design protein-protein interactions with enhanced generalization

TL;DR

Abstract

Learning to design protein-protein interactions with enhanced generalization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)