Table of Contents
Fetching ...

Nonparametric Modeling of Continuous-Time Markov Chains

Filippo Monti, Xiang Ji, Marc A. Suchard

TL;DR

This work addresses the challenge of inferring infinitesimal CTMC rates in large state spaces with incomplete data by connecting rate parameters to covariates through a flexible Gaussian process prior. It replaces restrictive linear covariate models with nonparametric GP priors and implements an efficient Hamiltonian Monte Carlo scheme augmented by a scalable gradient approximation that reduces gradient costs from $O(K^5)$ to $O(K^2)$, where $K$ is the number of states. The approach is extended to phylogeographic settings, deriving efficient tree-structured likelihoods and demonstrating strong performance on synthetic data and real viral datasets such as bat rabies and global influenza. The combination of GP based covariate modeling and scalable inference expands the applicability of CTMCs to complex, covariate-rich domains and provides practical tools for researchers in phylogenetics and epidemiology.

Abstract

Inferring the infinitesimal rates of continuous-time Markov chains (CTMCs) is a central challenge in many scientific domains. This task is hindered by three factors: quadratic growth in the number of rates as the CTMC state space expands, strong dependencies among rates, and incomplete information for many transitions. We introduce a new Bayesian framework that flexibly models the CTMC rates by incorporating covariates through Gaussian processes (GPs). This approach improves inference by integrating new information and contributes to the understanding of the CTMC stochastic behavior by shedding light on potential external drivers. Unlike previous approaches limited to linear covariate effects, our method captures complex non-linear relationships, enabling fuller use of covariate information and more accurate characterization of their influence. To perform efficient inference, we employ a scalable Hamiltonian Monte Carlo (HMC) sampler. We address the prohibitive cost of computing the exact likelihood gradient by integrating the HMC trajectories with a scalable gradient approximation, reducing the computational complexity from $O(K^5)$ to $O(K^2)$, where $K$ is the number of CTMC states. Finally, we demonstrate our method on Bayesian phylogeography inference -- a domain where CTMCs are central -- showing effectiveness on both synthetic and real datasets.

Nonparametric Modeling of Continuous-Time Markov Chains

TL;DR

This work addresses the challenge of inferring infinitesimal CTMC rates in large state spaces with incomplete data by connecting rate parameters to covariates through a flexible Gaussian process prior. It replaces restrictive linear covariate models with nonparametric GP priors and implements an efficient Hamiltonian Monte Carlo scheme augmented by a scalable gradient approximation that reduces gradient costs from to , where is the number of states. The approach is extended to phylogeographic settings, deriving efficient tree-structured likelihoods and demonstrating strong performance on synthetic data and real viral datasets such as bat rabies and global influenza. The combination of GP based covariate modeling and scalable inference expands the applicability of CTMCs to complex, covariate-rich domains and provides practical tools for researchers in phylogenetics and epidemiology.

Abstract

Inferring the infinitesimal rates of continuous-time Markov chains (CTMCs) is a central challenge in many scientific domains. This task is hindered by three factors: quadratic growth in the number of rates as the CTMC state space expands, strong dependencies among rates, and incomplete information for many transitions. We introduce a new Bayesian framework that flexibly models the CTMC rates by incorporating covariates through Gaussian processes (GPs). This approach improves inference by integrating new information and contributes to the understanding of the CTMC stochastic behavior by shedding light on potential external drivers. Unlike previous approaches limited to linear covariate effects, our method captures complex non-linear relationships, enabling fuller use of covariate information and more accurate characterization of their influence. To perform efficient inference, we employ a scalable Hamiltonian Monte Carlo (HMC) sampler. We address the prohibitive cost of computing the exact likelihood gradient by integrating the HMC trajectories with a scalable gradient approximation, reducing the computational complexity from to , where is the number of CTMC states. Finally, we demonstrate our method on Bayesian phylogeography inference -- a domain where CTMCs are central -- showing effectiveness on both synthetic and real datasets.

Paper Structure

This paper contains 49 sections, 32 equations, 4 figures.

Figures (4)

  • Figure 1: Simulation studies. Panel (a) compares log-linear (LL) and Gaussian process (GP) models in recovering CTMC log-rates from simulated data where true log-rates (dashed line) follow a quadratic function of host genetic distance. Solid lines show posterior median normalized log-rates; shaded regions represent 95% highest posterior density intervals (HPDIs). Only slope uncertainty is depicted due to normalization. Panel (b) compares computational efficiency, showing average time per gradient evaluation versus the number of CTMC states. The approximate gradient (red) scales as ${\cal O}(\NPRnstates^2)$ while the exact numerical gradient (blue) scales as ${\cal O}(\NPRnstates^5)$, with scaling trends anchored at 128 states.
  • Figure 2: Data examples: log-rates vs predictors. Panel (a) shows the effect of cross-species genetic distances on rabies virus transmission rates; panel (b) shows the effect of origin country population density on global flu transmission. Solid lines represent posterior median inferred log-rates under log-linear (LL) and Gaussian process-based (GP) models; shaded areas show 95% HPDIs. Only slope uncertainty is depicted due to normalization.
  • Figure 3: Bat Rabies data example. The viral tree (left) shows the rabies virus evolution in North American bats. Yellow bars represent 95% HPDIs for node ages; branch colors identify the most probable host bat species. The bats species tree (right) orders bat species by genetic distance using Ward's hierarchical clustering algorithm (Ward Jr., 1963). Color symmetry between trees suggests rabies transmission occurs preferentially among genetically similar bat species.
  • Figure 4: Global Flu data example. Maximum clade credibility (MCC) tree showing H3N2 influenza evolutionary history. Yellow bars on internal nodes represent 95% HPDIs for node ages. Branch colors indicate distinct air communities (map, top-right; dots represent analyzed airports), with each color showing the air community where the GP model predicts most time is spent. The bottom plot shows trunk rewards (cumulative time in years the CTMC spends in each state along the trunk, defined as external branches facing top-right) based on the GP model, with error bars representing 95% HPDIs.