Pairing interacting protein sequences using masked language modeling

Umberto Lupo; Damiano Sgarbossa; Anne-Florence Bitbol

Pairing interacting protein sequences using masked language modeling

Umberto Lupo, Damiano Sgarbossa, Anne-Florence Bitbol

TL;DR

This work introduces DiffPALM, a differentiable framework that leverages MLM signals from MSA Transformer to solve paralog matching by minimizing an MLM loss over optimally paired MSAs. By representing within-species pairings as permutation matrices and optimizing via Sinkhorn-based differentiable surrogates, DiffPALM outperforms traditional coevolution-based methods on shallow MSAs and benefits from known interacting pairs. The method extends to challenging eukaryotic complexes, where it can enhance AlphaFold-Multimer structure predictions without substantially harming other cases, and it can be competitive with orthology-based pairing. Overall, DiffPALM demonstrates the power of cross-MSA coevolution signals captured by protein language models trained on MSAs, enabling more accurate partner pairing and improved complex structure prediction in data-limited regimes.

Abstract

Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids. We show that it captures inter-chain coevolution, while it was trained on single-chain data, which means that it can be used out-of-distribution. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer, without significantly deteriorating any of those we tested. It also achieves competitive performance with using orthology-based pairing.

Pairing interacting protein sequences using masked language modeling

TL;DR

Abstract

Paper Structure (18 sections, 3 equations, 14 figures, 2 tables)

This paper contains 18 sections, 3 equations, 14 figures, 2 tables.

Goal and notations.
Dealing with asymmetric cases.
Formalization.
Construction of an appropriate MLM loss.
Noise and regularization.
Optimization.
Exploring the loss landscape through multiple initializations.
Result and confidence.
Improving precision: MRA and IPA.
Pairing methods employed in AFM and ColabFold.
Pairing using DiffPALM.
Supplementary methods
MSA Transformer and masked language modeling for MSAs
A differentiable formulation of paralog matching
Datasets
...and 3 more sections

Figures (14)

Figure 1: Performance of DiffPALM on small HK-RR MSAs. The performance of two variants of DiffPALM (MRA and IPA, see " \ref{['subsec:MRA-IPA']} ") is shown versus the number of runs used for the MRA variant, for $40$ MSAs comprising about 50 HK-RR pairs. The chance expectation, and the performance of various other methods, are reported as baselines. Three existing coevolution-based methods are considered: DCA-IPA Bitbol16, MI-IPA Bitbol18, and GA-IPA Gandarilla23. We also consider a pairing method based on the scores given by the ESM-2 (650M) single-sequence protein language model Lin2022, see " \ref{['subsec:methods_esm2']} ". With all methods, a full one-to-one within-species pairing is produced, and performance is measured by precision (also called positive predictive value or PPV), namely, the fraction of correct pairs among predicted pairs. The default score is "precision-100", where this fraction is computed over all predicted pairs (100% of them). For DiffPALM-MRA, we also report "precision-10", which is calculated over the top $10 \%$ predicted pairs, when ranked by predicted confidence within each MSA (see " \ref{['sec:methods']} "). For DiffPALM, we plot the mean performance on all MSAs (color shading), and the standard error range (shaded region). For our ESM-2 based method, we consider 10 different values of masking probability $p$ from $0.1$ to $1.0$, and we report the range of precisions obtained (gray shading). For other baselines, we report the mean performance on all MSAs.
Figure 2: Impact of positive examples and extension to another pair of protein families. We report the performance of DiffPALM with 5 MRA runs (measured as precision-100 and precision-10, see \ref{['fig:results_mra']}), for various numbers of positive examples, on the same HK-RR MSAs as in \ref{['fig:results_mra']} (left panel). We also report the performance of DiffPALM for similarly-sized MALG-MALK MSAs (right panel). In both cases, we show the mean value over the 40 different MSAs with its standard error interval, and we plot the chance expectation for reference.
Figure 3: Performance of AFM using different pairing methods. We use AFM to predict the structure of protein complexes starting from differently paired MSAs, each of them constructed from the same initial unpaired MSAs. Three pairing methods are considered: the default one of AFM, only pairing orthologs to the two query sequences, and a single run of DiffPALM (equivalent to one MRA run). Performance is evaluated using DockQ scores (top panels), a widely used measure of quality for protein-protein docking Basu16, and the AFM confidence scores (bottom panels), see " \ref{['subsec:generalities_AFM']} ". The latter are also used as transparency levels in the top panels, where more transparent markers denote predicted structures with low AFM confidence. For each query complex, AFM is run five times. Each run yields 25 predictions which are ranked by AFM confidence score. The five top predicted structures are selected from each run, giving 25 predicted structures in total for each complex. Out of the 15 complexes listed in \ref{['tab:dataset_pdb']}, we restrict to those where any two of these three pairing methods yield a significant difference ($>0.1$) in average DockQ scores for at least one set of predictions coming from different runs but with the same within-run rank according to AFM confidence. Panels are ordered by increasing mean DockQ score for the AFM default method.
Figure 4: Schematic of the DiffPALM method. First, the parameterization matrices $X_k$ are initialized, and then the following steps are repeated until the loss converges: (1) Compute the permutation matrix $M(X_k)$ and use it to shuffle $\mathcal{M}^{(\mathrm{A})}$ relative to $\mathcal{M}^{(\mathrm{B})}$. Then pair the two MSAs. (2) Randomly mask some tokens of one of the two sides of the paired MSA and compute the MLM loss \ref{['eq:MLM_loss']}. (3) Backpropagate the loss and update the parameterization matrices $X_k$, using the Sinkhorn operator $\hat{S}$ for the backward step instead of the matching operator $M$ (see " \ref{['supp_meth:diff_matching']} ").
Figure S1: Comparison of contact maps predicted by MSA Transformer for the correct pairing of an HK MSA and an RR MSA ("Correct pairs"), and for an incorrect pairing ("Shuffled pairs"). We observe that MSA Transformer is able to correctly predict the inter-protein contacts when given as input a paired MSA made of correctly matched sequences. Conversely, if the model is given as input a paired MSAs where rows have been shuffled before pairing, it is not able to recover the inter-protein contact map (even though it correctly recovers correctly the intra-protein contact maps). These results suggests that MSA Transformer can distinguish between interacting and non-interacting pairs of protein sequences, despite the fact that dimers or paired MSAs were not in the training set used for its MLM pre-training rao2021msa.
...and 9 more figures

Pairing interacting protein sequences using masked language modeling

TL;DR

Abstract

Pairing interacting protein sequences using masked language modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (14)