Nonparametric Modeling of Continuous-Time Markov Chains
Filippo Monti, Xiang Ji, Marc A. Suchard
TL;DR
This work addresses the challenge of inferring infinitesimal CTMC rates in large state spaces with incomplete data by connecting rate parameters to covariates through a flexible Gaussian process prior. It replaces restrictive linear covariate models with nonparametric GP priors and implements an efficient Hamiltonian Monte Carlo scheme augmented by a scalable gradient approximation that reduces gradient costs from $O(K^5)$ to $O(K^2)$, where $K$ is the number of states. The approach is extended to phylogeographic settings, deriving efficient tree-structured likelihoods and demonstrating strong performance on synthetic data and real viral datasets such as bat rabies and global influenza. The combination of GP based covariate modeling and scalable inference expands the applicability of CTMCs to complex, covariate-rich domains and provides practical tools for researchers in phylogenetics and epidemiology.
Abstract
Inferring the infinitesimal rates of continuous-time Markov chains (CTMCs) is a central challenge in many scientific domains. This task is hindered by three factors: quadratic growth in the number of rates as the CTMC state space expands, strong dependencies among rates, and incomplete information for many transitions. We introduce a new Bayesian framework that flexibly models the CTMC rates by incorporating covariates through Gaussian processes (GPs). This approach improves inference by integrating new information and contributes to the understanding of the CTMC stochastic behavior by shedding light on potential external drivers. Unlike previous approaches limited to linear covariate effects, our method captures complex non-linear relationships, enabling fuller use of covariate information and more accurate characterization of their influence. To perform efficient inference, we employ a scalable Hamiltonian Monte Carlo (HMC) sampler. We address the prohibitive cost of computing the exact likelihood gradient by integrating the HMC trajectories with a scalable gradient approximation, reducing the computational complexity from $O(K^5)$ to $O(K^2)$, where $K$ is the number of CTMC states. Finally, we demonstrate our method on Bayesian phylogeography inference -- a domain where CTMCs are central -- showing effectiveness on both synthetic and real datasets.
