Table of Contents
Fetching ...

On the lumpability of tree-valued Markov chains

Rodrigo B. Alves, Yuri F. Saporito, Luiz M. Carvalho

Abstract

Phylogenetic trees constitute an interesting class of objects for stochastic processes due to the non-standard nature of the space they inhabit. In particular, many statistical applications require the construction of Markov processes on the space of trees, whose cardinality grows superexponentially with the number of leaves considered. We investigate whether certain lower-dimensional projections of tree space preserve the Markov property in tree-valued Markov processes. We study exact lumpability of tree shapes and $\varepsilon$-lumpability of clades, exploiting the combinatorial structure of the SPR graph to obtain bounds on the lumping error under the random walk and Metropolis-Hastings processes. Finally, we show how to use these results in empirical investigation, leveraging exact and $\varepsilon$-lumpability to improve Monte Carlo estimation of tree-related quantities.

On the lumpability of tree-valued Markov chains

Abstract

Phylogenetic trees constitute an interesting class of objects for stochastic processes due to the non-standard nature of the space they inhabit. In particular, many statistical applications require the construction of Markov processes on the space of trees, whose cardinality grows superexponentially with the number of leaves considered. We investigate whether certain lower-dimensional projections of tree space preserve the Markov property in tree-valued Markov processes. We study exact lumpability of tree shapes and -lumpability of clades, exploiting the combinatorial structure of the SPR graph to obtain bounds on the lumping error under the random walk and Metropolis-Hastings processes. Finally, we show how to use these results in empirical investigation, leveraging exact and -lumpability to improve Monte Carlo estimation of tree-related quantities.

Paper Structure

This paper contains 25 sections, 20 theorems, 72 equations, 16 figures, 1 table.

Key Result

Proposition 1

Let $(X_k)_{k \ge 0}$ be an aperiodic and irreducible Markov Chain on finite state space $\mathcal{S}$ with stationary distribution $\pi_X$ and transition matrix $P$. Suppose that $(X_k)_{k \ge 0}$ is lumpable with respect to a partition $\bar{S}:=\{E_1, E_2, \dots, E_v\}$. The projected process $(Y

Figures (16)

  • Figure 1: Autocorrelation spectra of clade indicators for the lazy Metropolis-Hastings. We show the empirical autocorrelation spectra up to lag $k=50$ (black bars) for indicators of clades {t1, t2} and {t1, t2, t3} when sampling from a lazy Metropolis-Hastings with $\rho = 0.9$ on a single realisation. The autocorrelation function of the best-fitting two-state Markov chain is also shown (red line).
  • Figure 2: A tree $x \in \boldsymbol{T}_6$ and one of its clades. The clade $c$ as a subtree with leaves $\{1,2,5\}$ is shown in the dashed rectangle.
  • Figure 3: Labelled and unlabelled rooted trees. On the right, we depict a rooted, labelled binary phylogenetic tree with 6 leaves. On the left, we present the same tree, but without labels. From the unlabelled tree on the left, we can generate 30 distinct trees in $\boldsymbol{T}_5$.
  • Figure 4: The three possible rooted subtree prune-and-regraft (rSPR) operations. As described in the text, there are three main ways a rSPR operation can be performed, depending on how it interacts with the root. Please notice that open circles mean leaves and closed circles internal vertices.
  • Figure 5: A balanced (left) and a ladder (right) trees. These represent the trees with the largest and smallest neighbourhoods in the rSPR graph, respectively.
  • ...and 11 more figures

Theorems & Definitions (42)

  • Definition 1: Clade partition of tree space
  • Definition 2: Lumpability
  • Proposition 1
  • proof
  • Definition 3: Lumping error and $\varepsilon$-lumpability
  • Proposition 2: SPR neighbourhood Song2003
  • Corollary 1: SPR neighbourhood maximum and minimum Song2003
  • Lemma 1: The neighbourhood of tree containing a clade.
  • Lemma 2: Counting the neighbours that do not share a clade $c$ with $x \in S_1(c)$.
  • proof
  • ...and 32 more