Table of Contents
Fetching ...

Large-sample analysis of cost functionals for inference under the coalescent

Martina Favero, Jere Koskela

TL;DR

The results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance.

Abstract

The coalescent is a foundational model of latent genealogical trees under neutral evolution, but suffers from intractable sampling probabilities. Methods for approximating these sampling probabilities either introduce bias or fail to scale to large sample sizes. We show that a class of cost functionals of the coalescent with recurrent mutation and a finite number of alleles converge to tractable processes in the infinite-sample limit. A particular choice of costs yields insight about importance sampling methods, which are a classical tool for coalescent sampling probability approximation. These insights reveal that the behaviour of coalescent importance sampling algorithms differs markedly from standard sequential importance samplers, with or without resampling. We conduct a simulation study to verify that our asymptotics are accurate for algorithms with finite (and moderate) sample sizes. Our results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance. We observe strikingly different behaviour for importance sampling methods under the infinite sites model of mutation, which is regarded as a good and more tractable approximation of finite alleles mutation in most respects.

Large-sample analysis of cost functionals for inference under the coalescent

TL;DR

The results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance.

Abstract

The coalescent is a foundational model of latent genealogical trees under neutral evolution, but suffers from intractable sampling probabilities. Methods for approximating these sampling probabilities either introduce bias or fail to scale to large sample sizes. We show that a class of cost functionals of the coalescent with recurrent mutation and a finite number of alleles converge to tractable processes in the infinite-sample limit. A particular choice of costs yields insight about importance sampling methods, which are a classical tool for coalescent sampling probability approximation. These insights reveal that the behaviour of coalescent importance sampling algorithms differs markedly from standard sequential importance samplers, with or without resampling. We conduct a simulation study to verify that our asymptotics are accurate for algorithms with finite (and moderate) sample sizes. Our results constitute the first theoretical description of large-sample importance sampling algorithms for the coalescent, provide heuristics for the a priori optimisation of computational effort, and identify settings where resampling is harmful for algorithm performance. We observe strikingly different behaviour for importance sampling methods under the infinite sites model of mutation, which is regarded as a good and more tractable approximation of finite alleles mutation in most respects.

Paper Structure

This paper contains 25 sections, 5 theorems, 76 equations, 7 figures.

Key Result

Theorem 3.3

Let $\mathbf{Z}^{(n)}=(C^{(n)}, \mathbf{Y}^{(n)},\mathbf{M}^{(n)}) \subset \mathbb{R}_+ \times \frac{1}{n}\mathbb{N}^d\setminus \{\boldsymbol{0}\} \times \mathbb{N}^{d^2}, n\in\mathbb{N},$ be the sequence composed by the cost sequence $C^{(n)}$ of Definition def:Cn, the scaled block-counting sequenc the mutation-counting process $\mathbf{M}=(M_{ij})_{i,j=1}^d$ is the matrix-valued process with $M_

Figures (7)

  • Figure 1: Logarithms of normalised second moments of importance weights under the GT and SD proposals, measured by stopping replicates upon first hitting each fixed number of remaining lineages. Each figure is an average over 10 000 replicates.
  • Figure 2: Number of simulated coalescence steps from the one-step proposal distribution $q_{SD}( \cdot | \cdot )$ for the four schedules with $\Gamma = 10^4$, $\gamma = 100$, $\theta = 0.5$, and $\chi = 0.1$. Note the log-scale on both axes.
  • Figure 3: Performance of the four schedules with $\gamma = 10^2$ and $\Gamma = 10^4$ for various sample sizes based on independent simulations at points $\theta \in \{0.1, 0.2, \ldots, 0.9\}$. Standard errors were computed using the method of chan:2013 for schedule 2, where replicates are not independent. The data-generating parameter is $\theta = 0.5$. Each y-axis is multiplied by appropriate, large constant to aid visualisation, and small horizontal offsets have been artificially added to all four curves in each panel for visual clarity.
  • Figure 4: A repeat of the simulation in Figure \ref{['fig:sd-surfaces:a']} in which replicates were stopped whenever the number of lineages decreased. Once all replicates had stopped, systematic resampling chopin:2020 was performed if the effective sample size, ESS in \ref{['ess']}, was less than 10% of the number of replicates. The y-axis is expressed in units of $10^{-13}$ to aid visualisation, and small horizontal offsets have been artificially added to all four curves for visual clarity.
  • Figure 5: Logarithms of normalised second moments of importance weights for the GT, SD, and HUW proposals, measured by stopping replicates upon first hitting each fixed number of remaining lineages. Each figure was obtained by averaging $10^5$ replicates. The results from Figure \ref{['fig:cost:a']} and \ref{['fig:cost:b']} are reproduced in dashed lines in panels \ref{['fig:cost-ism:a']} and \ref{['fig:cost-ism:b']} for ease of comparison.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Definition 2.1: Forward transition probabilities
  • Definition 2.2: Backward transition probabilities
  • Remark 2.3: Parent-independent Mutations (PIM)
  • Definition 2.4: Scaled block-counting sequence
  • Definition 2.5: Mutation-counting sequence
  • Definition 2.6: Cost-counting sequence
  • Theorem 3.3: Convergence of general costs
  • proof
  • Proposition 4.1: Asymptotic cost of one GT step
  • proof
  • ...and 8 more