Table of Contents
Fetching ...

On estimating the effective sample size of phylogenetic trees in an autocorrelated chain

Jonathan Klawitter, Lars Berling, Jordan Douglas, Dong Xie, Alexei J. Drummond

TL;DR

This work compares existing tree ESS estimators with novel approaches that leverage tractable tree distributions, specifically Conditional Clade Distributions (CCDs), as well as a new probabilistic estimator based on clade frequency differences between independent chains, and examines how multimodality in posterior distributions and poor mixing can substantially affect ESS estimates.

Abstract

Estimating the effective sample size (ESS) is fundamental in Bayesian phylogenetic inference to properly account for autocorrelation in MCMC samples. While methods for continuous parameters are well established, the discrete and high-dimensional nature of treespace poses substantial challenges. Here, we compare existing tree ESS estimators with novel approaches that leverage tractable tree distributions, specifically Conditional Clade Distributions (CCDs), as well as a new probabilistic estimator based on clade frequency differences between independent chains. Using simulated chains with known ESS bounds, we assess estimator accuracy and evaluate their stability and robustness on simulated and real datasets. We further examine how multimodality in posterior distributions and poor mixing can substantially affect ESS estimates, highlighting the need for careful interpretation. Our CCD-based estimators perform comparably to existing approaches, with two methods showing lower variance by averaging across multiple estimates. In contrast, the probabilistic estimator and two previously recommended methods incur prohibitive computational costs for long chains. Together, these results provide guidance for reliable and efficient tree ESS estimation in complex phylogenetic analyses.

On estimating the effective sample size of phylogenetic trees in an autocorrelated chain

TL;DR

This work compares existing tree ESS estimators with novel approaches that leverage tractable tree distributions, specifically Conditional Clade Distributions (CCDs), as well as a new probabilistic estimator based on clade frequency differences between independent chains, and examines how multimodality in posterior distributions and poor mixing can substantially affect ESS estimates.

Abstract

Estimating the effective sample size (ESS) is fundamental in Bayesian phylogenetic inference to properly account for autocorrelation in MCMC samples. While methods for continuous parameters are well established, the discrete and high-dimensional nature of treespace poses substantial challenges. Here, we compare existing tree ESS estimators with novel approaches that leverage tractable tree distributions, specifically Conditional Clade Distributions (CCDs), as well as a new probabilistic estimator based on clade frequency differences between independent chains. Using simulated chains with known ESS bounds, we assess estimator accuracy and evaluate their stability and robustness on simulated and real datasets. We further examine how multimodality in posterior distributions and poor mixing can substantially affect ESS estimates, highlighting the need for careful interpretation. Our CCD-based estimators perform comparably to existing approaches, with two methods showing lower variance by averaging across multiple estimates. In contrast, the probabilistic estimator and two previously recommended methods incur prohibitive computational costs for long chains. Together, these results provide guidance for reliable and efficient tree ESS estimation in complex phylogenetic analyses.
Paper Structure (35 sections, 10 equations, 78 figures, 5 tables)

This paper contains 35 sections, 10 equations, 78 figures, 5 tables.

Figures (78)

  • Figure 1: Each panel shows the 95% credible intervals in clade support probabilities under the EDCF strategy, given a fixed ESS. Larger ESSes give smaller sampling errors across the $m=2$ independent chains. This sampling error can be used to estimate the ESS using both maximum likelihood and Bayesian inference approaches.
  • Figure 2: Scaled autocorrelation signatures for DS3 of the logP1 and expRF1 traces for the three chain simulation methods with a target ACT$=25$, and the posterior and $\kappa$ (transition/transversion rate ratio) traces for the corresponding MCMC chain.
  • Figure 3: Estimated tree ESS for DS3 and DS4 on simulated RNNI chains with an underlying ACT of 5 across different chain lengths and different ESS estimators. Estimator accuracy is measured as the deviation from the diagonal (lower true ESS bound); summary results of this accuracy evaluation are shown in \ref{['fig:accuracy:boxes:ds']}.
  • Figure 4: Relative estimated tree ESS for DS3 and DS4, showing the same data as \ref{['fig:acc_results_xy']} but normalised by the expected lower bound. The EDCF estimator is omitted due to its comparatively high values.
  • Figure 5: Summary of estimator accuracy on simulated RNNI chains for all eleven datasets. Performance is quantified using the relative mean error, taking into account the known ESS bound. Results are reported up to an ACT of 25 due to computational limitations.
  • ...and 73 more figures