Table of Contents
Fetching ...

Functional bottlenecks can emerge from non-epistatic underlying traits

Anna Ottavia Schulte, Samar Alqatari, Saverio Rossi, Francesco Zamponi

TL;DR

Protein fitness landscapes exhibit epistasis, and it remains debated whether functional bottlenecks require network epistasis or can arise under global epistasis. We propose a stylized model where fitness is a nonlinear function of an additive trait $E(\mathbf{a})=\sum_i h_i a_i$ with two phenotypes, blue and red, defined by thresholds; two fitness mappings $F_B(E)$ and $F_R(E)$ drive the evolutionary transitions. After calibrating the model to empirical data and exploring different SME distributions, we show bottleneck topologies arise with high probability when there is a mix of nearly neutral and strongly non-neutral mutations, even in the absence of higher-order interactions. The calibrated ensemble exhibits a mid-path jumper genotype through which all viable paths must pass, and the number of viable paths grows exponentially with mutational distance, implying sustained evolutionary accessibility. Overall, the work reveals mutational-effect heterogeneity as a key determinant of fitness-landscape topology and demonstrates that functional bottlenecks can emerge from global epistasis alone.

Abstract

Protein fitness landscapes frequently exhibit epistasis, where the effect of a mutation depends on the genetic context in which it occurs, i.e., the rest of the protein sequence. Epistasis increases landscape complexity, often resulting in multiple fitness peaks. In its simplest form, known as global epistasis, fitness is modeled as a non-linear function of an underlying additive trait. In contrast, more complex epistasis arises from a network of (pairwise or many-body) interactions between residues, which cannot be removed by a single non-linear transformation. Recent studies have explored how global and network epistasis contribute to the emergence of functional bottlenecks - fitness landscape topologies where two broad high-fitness basins, representing distinct phenotypes, are separated by a bottleneck that can only be crossed via one or a few mutational paths. Here, we introduce and analyze a stylized model of global epistasis with an additive underlying trait. We demonstrate that functional bottlenecks arise with high probability if the model is properly calibrated. Furthermore, our results underscore that a proper balance between neutral and non-neutral mutations is needed for the emergence of functional bottlenecks.

Functional bottlenecks can emerge from non-epistatic underlying traits

TL;DR

Protein fitness landscapes exhibit epistasis, and it remains debated whether functional bottlenecks require network epistasis or can arise under global epistasis. We propose a stylized model where fitness is a nonlinear function of an additive trait with two phenotypes, blue and red, defined by thresholds; two fitness mappings and drive the evolutionary transitions. After calibrating the model to empirical data and exploring different SME distributions, we show bottleneck topologies arise with high probability when there is a mix of nearly neutral and strongly non-neutral mutations, even in the absence of higher-order interactions. The calibrated ensemble exhibits a mid-path jumper genotype through which all viable paths must pass, and the number of viable paths grows exponentially with mutational distance, implying sustained evolutionary accessibility. Overall, the work reveals mutational-effect heterogeneity as a key determinant of fitness-landscape topology and demonstrates that functional bottlenecks can emerge from global epistasis alone.

Abstract

Protein fitness landscapes frequently exhibit epistasis, where the effect of a mutation depends on the genetic context in which it occurs, i.e., the rest of the protein sequence. Epistasis increases landscape complexity, often resulting in multiple fitness peaks. In its simplest form, known as global epistasis, fitness is modeled as a non-linear function of an underlying additive trait. In contrast, more complex epistasis arises from a network of (pairwise or many-body) interactions between residues, which cannot be removed by a single non-linear transformation. Recent studies have explored how global and network epistasis contribute to the emergence of functional bottlenecks - fitness landscape topologies where two broad high-fitness basins, representing distinct phenotypes, are separated by a bottleneck that can only be crossed via one or a few mutational paths. Here, we introduce and analyze a stylized model of global epistasis with an additive underlying trait. We demonstrate that functional bottlenecks arise with high probability if the model is properly calibrated. Furthermore, our results underscore that a proper balance between neutral and non-neutral mutations is needed for the emergence of functional bottlenecks.

Paper Structure

This paper contains 23 sections, 11 equations, 13 figures.

Figures (13)

  • Figure 1: Experimental data from Ref. poelwijk2019learning. (A) Schematic shape of the fitness functions for the red and blue phenotypes as a function of the underlying trait $E$. Sigmoid functions have been used for illustrations. The two reference structures with the corresponding mutations are also shown, from Ref. poelwijk2019learning. (B) Histogram of the value of $E$ obtained from Eqs. (\ref{['eq:EvsF']},\ref{['eq:EAEB']}) for each of the $2^{13}$ variants. The red and blue lines correspond respectively to the reference values $E^{\rm {ref}}_R$ and $E^{\rm {ref}}_B$. (C) Histogram of the absolute value $|\Delta E|$ of single mutational effects (SMEs) from Eq. \ref{['eq:SME']} for all $13\times 2^{13}$ single mutations that can be obtained from the dataset, shown in log-log (main panel) and lin-log (inset) scales. The black curve is the fit obtained with the Pareto distribution in Eq. \ref{['eq:Pareto']}. (D) Topology of the space of paths obtained keeping only genotypes ${\bf a}$ with $|E({\bf a})|>E_C$, and finding the largest possible value of $E_C=0.53$ (dashed line) such that the red and blue reference sequences remain connected. The lower panel reports the values of $E({\bf a})$ for each functional genotype (blue dots for $E>E_C$ and red dots for $E<-E_C$) as a function of the number of mutations $m$ from the blue reference, with the gray lines connecting pairs of genotypes that differ by a single mutation. The upper panel shows the resulting graph of connections.
  • Figure 2: Same representation as in Fig. \ref{['fig:1']}d, using an instance of the calibrated model with Gaussian $P(h)$ instead of the experimental data. The lower panel reports the value of $E/E_T$ for each functional variant as a function of the distance from the blue reference variant (here with $E^{\rm {ref}}_B/E_T\approx 1.50$ and $E^{\rm {ref}}_R/E_T\approx-1.79$). The threshold value for which at least a mutational path remains is $E_C/E_T\approx0.72$ (dashed lines) and when normalized with respect to the largest of the two reference genotypes it reads $E_C/E^{\rm {ref}}_{\max}\approx0.40$.
  • Figure 3: Calibration of the two models with different $P(h)$, a Gaussian distribution or a Pareto distribution with cutoff. The black symbols correspond to the choice of $p$ after calibration. (A) Value of $E_T$ for which $\langle M \rangle \approx 8$ as a function of $p$. Dashed lines are linear fits. (B) Average value $\langle E^{\rm {ref}}_B/E_T \rangle$ as a function of $p$. (C) Average of $E_C/E^{\rm {ref}}_{\max}$ as a function of $p$. (D) Probability of having a single jumper at $E_C$, or equivalently that $E_C < E^{\rm {ref}}_{\min}$, as a function of $p$.
  • Figure 4: Distribution of some relevant quantities for two different calibrated models, the Gaussian model and the Pareto cutoff model. In the former, $L=500$ SMEs are extracted from a Gaussian distribution with unit variance and the tuning procedure has $p=0.26$, $E_T=2.0$; in the latter, $L=500$ SMEs are extracted from a distribution with Pareto tails decaying with $\alpha=0.7$ and a cutoff $|h|<2.0$, and the tuning procedure has $p=0.25$, $E_T=1.1$. In each plot the dashed and dot-dashed vertical lines represent the mean and the median of the data, respectively. (A) Distribution of the number of mutations. (B) Distribution of the value of the positive (blue) reference phenotype divided by $E_T$. (C) Distribution of the number of greedy steps needed to reach the target value $E_T$ when generating the two reference variants. (D) Distribution of SMEs $\tilde{h}_i$ selected by the tuning procedure to generate the two reference variants.
  • Figure 5: Statistical properties of the space of paths for two different calibrated model, the Gaussian model (A-C) and the Pareto cutoff model (D-F), with the same parameters as in Fig. \ref{['fig:4']}. (A), (D) Distribution of values of $E_C/E^{\rm {ref}}_{\rm max}$, separately for several values of $M$. The mean and the median (dashed and dot-dashed vertical lines, respectively) are computed with respect to the full distribution (with all $M$ values). (B), (E) Probability distribution of the jumper position $j$ along the path, for different values of $M$. (C), (F) Average of logarithm of the number of paths that remain viable at $E_C$, as a function of $M$. (G) Some examples of topologies, each associated to a schematic representation used in panel b.
  • ...and 8 more figures