Predicting sub-population specific viral evolution
Wenxian Shi, Menghua Wu, Regina Barzilay
TL;DR
The paper introduces a sub-population–aware framework for predicting time-varying distributions of viral protein sequences by learning transmission rates between locations via a neural-network–parameterized matrix A(x;theta). The dynamics are governed by a linear ODE, with closed-form solutions obtained through eigen-decomposition, and probabilities computed via autoregressive conditioning on amino-acid prefixes. A hierarchical variant reduces computational complexity by block-diagonalizing the transmission matrix and employing group-based approximations (G2L, G2G), while transfer-like sharing across locations supports data-sparse sub-populations. Empirical results on SARS-CoV-2 and influenza A/H3N2 show superior predictive performance and alignment between learned transmission pathways and phylogenetic analyses, with high coverage of future circulating sequences. The work highlights practical implications for location-specific vaccine design and surveillance, while noting limitations including the need for sub-population annotations and the assumption of time-stable, linear dynamics.
Abstract
Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.
