Modeling Protein Evolution via Generative Inference From Monte Carlo Chains to Population Genetics

Leonardo Di Bari; Thierry Mora; Andrea Pagnani; Aleksandra M. Walczak; Francesco Zamponi; Saverio Rossi

Modeling Protein Evolution via Generative Inference From Monte Carlo Chains to Population Genetics

Leonardo Di Bari, Thierry Mora, Andrea Pagnani, Aleksandra M. Walczak, Francesco Zamponi, Saverio Rossi

TL;DR

This study benchmarks three generative-evolution schemes on fitness landscapes inferred from Direct Coupling Analysis across four in vitro experiments. By comparing standard MCMC, tree-based MCMC, and population-genetics dynamics, the authors reveal that incorporating phylogenetic structure and finite-population effects is essential to reproduce non-equilibrium evolutionary trajectories, mutational spectra, and lineage correlations. While MCMC can capture equilibrium statistics, PopGen provides realistic selective sweeps and emergent phylogeny at the cost of long-time equilibrium sampling, with treeMCMC offering a middle ground. The work highlights the need to consider phylogenetic context and population dynamics when forecasting evolutionary paths and underscores the potential of these approaches for guiding protein engineering and extrapolating beyond current experimental data.

Abstract

Generative models derived from large protein sequence alignments define complex fitness landscapes, but their utility for accurately modeling non-equilibrium evolutionary dynamics remains unclear. In this work, we perform a rigorous comparative analysis of three simulation schemes, designed to mimic evolution in silico by local sampling of the probability distribution defined by a generative model. We compare standard independent Markov Chain Monte Carlo, Monte Carlo on a phylogenetic tree, and a population genetics dynamics, benchmarking their outputs against deep sequencing data from four distinct in vitro evolution experiments. We find that standard Monte Carlo fails to reproduce the correct phylogenetic structure and generates unrealistic, gradual mutational sweeps. Performing Monte Carlo on a tree inferred from data improves phylogenetic fidelity and historical accuracy. The population genetics scheme successfully captures phylogenetic correlations, mutational abundances, and selective sweeps as emergent properties, without the need to infer additional information from data. However, the latter choice come at the price of not sampling the proper generative model distribution at long times. Our findings highlight the crucial role of phylogenetic correlations and finite-population effects in shaping evolutionary trajectories on fitness landscapes. These models therefore provide powerful tools for predicting complex adaptive paths and for reliably extrapolating evolutionary dynamics beyond current experimental limitations.

Modeling Protein Evolution via Generative Inference From Monte Carlo Chains to Population Genetics

TL;DR

Abstract

Paper Structure (19 sections, 4 equations, 11 figures, 1 table)

This paper contains 19 sections, 4 equations, 11 figures, 1 table.

Introduction
Materials and Methods
Experimental Datasets
DCA-derived fitness landscape
Codon accessibility
Simulation schemes
Calibration
Results
Divergence & Selection
Diversity & Phylogeny
Mutational Spectra
Mutational Dynamics
Long-time behavior and approach to equilibrium
Discussion
Detailed description of the MCMC simulation scheme
...and 4 more sections

Figures (11)

Figure 1: Schematic of the experimental and simulation pipeline.In vitro evolution data are used to benchmark three evolutionary simulation schemes (MCMC, treeMCMC, PopGen) over four different evolution experiments realized through rounds of mutagenesis, selection and amplification. Simulated and experimental sequences are compared through divergence, diversity, phylogeny, and mutational spectra.
Figure 2: Comprehensive mutational and fitness landscape of TEM1. (a) Nucleotide accessibility map for TEM1. The heatmap shows the average nucleotide distance from each wild-type codon to all possible codons corresponding to a given amino acid substitution. (b) Correlation between experimental variant frequencies in the evolution experiment of Ref. fantini2020protein (round 12) and deep mutational scanning (DMS) fitness measurements ($\Delta F$). Each data point corresponds to a single amino acid mutation. Data points are colored according to their average nucleotide distance from the wild-type sequence. The $\Delta F$ values were obtained from DMS studies of TEM1 variants GONZALEZ2019fitness, with the vertical dashed line at $\Delta F = 0$ demarcating the boundary between beneficial and deleterious mutations.
Figure 3: Tuning of simulation schemes for the PSE1 experiment. The distribution of (a) fractional Hamming distance $H/L$ from the wild-type and (b) DCA energy $E$ for the experimental PSE1 population compared to the three simulation schemes after parameter tuning.
Figure 4: Comparative Phylogenetic Statistics. Distributions of evolutionary metrics for the PSE1, TEM1, AAC6, and mDHFR experiments against simulation results. The top row (a) displays the normalized pairwise Hamming distances between sequences, and the bottom row (b) illustrates the cumulative density function of terminal branch lengths derived from the inferred phylogenetic trees. Neutral (grey) curves corresponds to a PopGen simulation without the selection and amplification steps.
Figure 5: Site Frequency Spectra (SFS). The frequency distribution of mutations is compared across the PSE1, TEM1, AAC6, and mDHFR experiments for all simulation schemes. Neutral (grey) simulation corresponds to a PopGen setup without a selection step. Curves are averaged over $10$ different realizations of each dynamics.
...and 6 more figures

Modeling Protein Evolution via Generative Inference From Monte Carlo Chains to Population Genetics

TL;DR

Abstract

Modeling Protein Evolution via Generative Inference From Monte Carlo Chains to Population Genetics

Authors

TL;DR

Abstract

Table of Contents

Figures (11)