Out-of-equilibrium selection pressure enhances inference from protein sequence data

Nicola Dietler; Cyril Malbranke; Anne-Florence Bitbol

Out-of-equilibrium selection pressure enhances inference from protein sequence data

Nicola Dietler, Cyril Malbranke, Anne-Florence Bitbol

TL;DR

Out-of-equilibrium noise arising from ubiquitous variations in natural selection thus enhances, rather than hinders, the success of inference from protein sequences, and coevolutionbased inference of structural contacts is improved.

Abstract

Homologous proteins have similar three-dimensional structures and biological functions that shape their sequences. The resulting coevolution-driven correlations underlie methods from Potts models to AlphaFold, which infer protein structure and function from sequences. Using a minimal model, we show that fluctuating selection strength and the onset of new selection pressures improve coevolution-based inference of structural contacts. Our conclusions extend to realistic synthetic data and to the inference of interaction partners. Out-of-equilibrium noise arising from ubiquitous variations in natural selection thus enhances, rather than hinders, the success of inference from protein sequences.

Out-of-equilibrium selection pressure enhances inference from protein sequence data

TL;DR

Abstract

Paper Structure (8 sections, 2 equations, 15 figures)

This paper contains 8 sections, 2 equations, 15 figures.

Introduction.
Model and methods.
Fluctuating selection enhances contact inference.
Extension to realistic synthetic data.
Impact of selection onset with phylogeny.
Extension to interaction partner inference.
Discussion.
Code accessibility.

Figures (15)

Figure 1: Impact of fluctuating selection strength on contact prediction. TP fraction versus number of accepted mutations per sequence, while switching selection strength (i.e. the sampling temperature) via a telegraph process with timescale $\tau$. (a) For each value of $\tau$, the TP fraction is averaged over 1000 replicates. In each replicate, the same telegraph process is used for all sequences in the MSA. Different replicates use different realizations of the data generation and of the telegraph process. (b) The TP fraction is shown for single realizations of the telegraph process and data generation. Gray background: $T_{2}$; white background: $T_{1}$. (c) Same as in (a), except that a different telegraph process is used for each sequence in an MSA. In all panels, TP fractions for equilibrium sequences generated at $T_1$ and $T_2$ are shown for reference. Sequences are generated using our minimal model, see Fig. S1(a-b), with $T_{1} = 1$ and $T_{2} = 15$. For each realization, we generate an MSA of 2048 sequences of length 200 using an Erdős-Rényi random graph with probability 0.02 to represent contacts, and infer contacts via mean-field DCA (mfDCA) Marks11Morcos11, which is computationally efficient, and performs well on minimal data Dietler2023, with pseudocount $0.01$.
Figure 2: Impact of fluctuating selection strength on contact prediction: PF0004 family. As in Fig. \ref{['fig:tpfrac_telegraph']}(b), the TP fraction is shown versus the number of accepted mutations under switching selection strength via a telegraph process, but for realistic data composed of MSAs of 70,000 sequences generated from a Potts model inferred on a natural MSA of 39,277 sequences from the PF0004 family (with length 132), following Refs. Lupo22Figliuzzi18, in particular using regularization strengths 0.01. Gray background: $T_{2}$; white background: $T_{1}$ (alternations not shown in the first panel for readability). Inference performances for equilibrium sequences generated at $T_1$, $T_2$, and $T=1$, are shown for reference. Contact inference is performed using plmDCA Ekeberg13Ekeberg14, with regularization strengths set to 0.01 and no phylogenetic reweighting.
Figure 3: Impact of selection onset on contact inference with phylogeny. The TP fraction is shown versus the number $\mu$ of accepted mutations per branch of the star phylogenetic tree, starting from a random ancestral sequence. Sequences with length 200 are generated as explained in Fig. S1(c), using the same Erdős-Rényi graph as in Fig. \ref{['fig:tpfrac_telegraph']}, along a star phylogenetic tree with 2048 branches. Along each branch, Monte Carlo sampling is performed until $\mu$ mutations are accepted, giving 2048 final sequences, which constitute our MSA. We infer contacts via mfDCA Marks11Morcos11Dietler2023 with pseudocount 0.01. Results are averaged over 100 replicates, each starting from a different random ancestor.
Figure 4: Impact of fluctuating selection strength on interaction partner prediction. TP fraction versus number of accepted mutations, while switching selection strength (i.e. the sampling temperature) via a telegraph process with timescale $\tau$. For each value of $\tau$, the TP fraction is averaged over 1000 replicates. TP fractions for equilibrium sequences generated at $T_1$ and $T_2$ are shown for reference. MSAs comprising 1024 sequences are generated as in Fig. \ref{['fig:tpfrac_telegraph']}(a), and then randomly split into a training set of 400 sequences and a testing set of 624 sequences. The latter is randomly divided in sets of 4 sequences each, representing species, and each sequence is split in two halves of equal length representing two interaction partners. Pairings between partners are blinded in each species, and predicted using mfDCA scores with pseudocount $0.01$Gerardos22.
Figure S1: Minimal model. (a) We model structural contacts as couplings set to 1 on the edges of an Erdős-Rényi random graph. Couplings between other nodes are set to 0. Equilibrium sampling of independent sequences is performed using a Metropolis--Hastings algorithm under the Hamiltonian in Eq. (1), with spin flip acceptance probability in Eq. (2), starting from random initial sequences. The MSA formed from the resulting sequences features correlations arising from couplings, shown between columns $0$ and $2$ (blue). (b) To generate sequences with fluctuating selection strength, we start from sequences generated at equilibrium with temperature $T_{1}$. We then evolve each of these sequences using the same Metropolis--Hastings algorithm as before, but with a sampling temperature $T$ that switches between $T_{1}$ and $T_{2}$. (c) To minimally model the impact of emerging selection during evolution along a phylogeny, we take a random sequence as ancestor, and we evolve it along a star phylogenetic tree (green) where mutations (red) are accepted with the probability in Eq. (2). This represents selection for structural contacts being switched on. On each branch of the tree, $\mu$ mutations are accepted, with $\mu=2$ in our schematic.
...and 10 more figures

Out-of-equilibrium selection pressure enhances inference from protein sequence data

TL;DR

Abstract

Out-of-equilibrium selection pressure enhances inference from protein sequence data

Authors

TL;DR

Abstract

Table of Contents

Figures (15)