Table of Contents
Fetching ...

Predicting sub-population specific viral evolution

Wenxian Shi, Menghua Wu, Regina Barzilay

TL;DR

The paper introduces a sub-population–aware framework for predicting time-varying distributions of viral protein sequences by learning transmission rates between locations via a neural-network–parameterized matrix A(x;theta). The dynamics are governed by a linear ODE, with closed-form solutions obtained through eigen-decomposition, and probabilities computed via autoregressive conditioning on amino-acid prefixes. A hierarchical variant reduces computational complexity by block-diagonalizing the transmission matrix and employing group-based approximations (G2L, G2G), while transfer-like sharing across locations supports data-sparse sub-populations. Empirical results on SARS-CoV-2 and influenza A/H3N2 show superior predictive performance and alignment between learned transmission pathways and phylogenetic analyses, with high coverage of future circulating sequences. The work highlights practical implications for location-specific vaccine design and surveillance, while noting limitations including the need for sub-population annotations and the assumption of time-stable, linear dynamics.

Abstract

Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.

Predicting sub-population specific viral evolution

TL;DR

The paper introduces a sub-population–aware framework for predicting time-varying distributions of viral protein sequences by learning transmission rates between locations via a neural-network–parameterized matrix A(x;theta). The dynamics are governed by a linear ODE, with closed-form solutions obtained through eigen-decomposition, and probabilities computed via autoregressive conditioning on amino-acid prefixes. A hierarchical variant reduces computational complexity by block-diagonalizing the transmission matrix and employing group-based approximations (G2L, G2G), while transfer-like sharing across locations supports data-sparse sub-populations. Empirical results on SARS-CoV-2 and influenza A/H3N2 show superior predictive performance and alignment between learned transmission pathways and phylogenetic analyses, with high coverage of future circulating sequences. The work highlights practical implications for location-specific vaccine design and surveillance, while noting limitations including the need for sub-population annotations and the assumption of time-stable, linear dynamics.

Abstract

Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.

Paper Structure

This paper contains 20 sections, 22 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: SARS-CoV-2 clade distributions differ by geographical location and change interdependently over time.
  • Figure 2: The transmission model: an auto-regressive, time-resolved generative model, parametrized in terms of transmission rate matrices.
  • Figure 3: Hierarchical transmission model. (a) Instead of modeling transmission between all pairs of countries, we model interactions within and across groups of locations (e.g. continents). (b) This idea can be realized by a block-wise diagonal $\tilde{A}$. (c) Two strategies for learning transmissions within each group.
  • Figure 4: Average negative log-likelihood (NLL) and reverse negative log-likelihood for Flu and Cov. Lower is better. Error bands represent the 95% confidence interval across different oracle models.
  • Figure 5: (a) Average transmission rate matrix among sequences collected during the 2018 winter flu season in clade 3C.2a1b.2b. (b) The maximal spanning tree obtained from the rates matrices. (c) The phylogenetic tree obtained from the Nextstrain. Africa (AF) is not included due to insufficient data.
  • ...and 11 more figures

Theorems & Definitions (1)

  • proof