Table of Contents
Fetching ...

Generalised Bayesian distance-based phylogenetics for the genomics era

Matthew J. Penn, Neil Scheidwasser, Mark P. Khurana, Christl A. Donnelly, David A. Duchêne, Samir Bhatt

TL;DR

This work addresses the computational bottlenecks of likelihood-based and Bayesian phylogenetic methods in the genome era by introducing a generalized Bayesian distance-based framework built on an entropic likelihood. The entropic approach defines an inter-taxa entropic distance $d^S_{ij}$ and formulates a likelihood $ old{ell}_S$ that is computationally efficient and closely related to Felsenstein's likelihood through a linear relationship that can be calibrated. The authors demonstrate that the entropic method yields Bayesian posteriors that align with distance-based bootstrap distributions on standard benchmarks, and scales to massive datasets such as a 60-million-site avian alignment, revealing substantial uncertainty in post-K-Pg diversification. They also provide analytical justifications for the near-linearity with Felsenstein's likelihood and establish a practical calibration (gradient $m$) to connect the entropic and traditional likelihoods, enabling robust, scalable phylogenetic inference in genomics-scale analyses. The practical impact lies in enabling uncertainty-aware, model-based phylogenetics on thousands of taxa and millions of sites, making genome-scale evolutionary inference feasible without sacrificing principled probabilistic interpretation.

Abstract

As whole genomes become widely available, maximum likelihood and Bayesian phylogenetic methods are demonstrating their limits in meeting the escalating computational demands. Conversely, distance-based phylogenetic methods are efficient, but are rarely favoured due to their inferior performance. Here, we extend distance-based phylogenetics using an entropy-based likelihood of the evolution among pairs of taxa, allowing for fast Bayesian inference in genome-scale datasets. We provide evidence of a close link between the inference criteria used in distance methods and Felsenstein's likelihood, such that the methods are expected to have comparable performance in practice. Using the entropic likelihood, we perform Bayesian inference on three phylogenetic benchmark datasets and find that estimates closely correspond with previous inferences. We also apply this rapid inference approach to a 60-million-site alignment from 363 avian taxa, covering most avian families. The method has outstanding performance and reveals substantial uncertainty in the avian diversification events immediately after the K-Pg transition event. The entropic likelihood allows for efficient Bayesian phylogenetic inference, accommodating the analysis demands of the genomic era.

Generalised Bayesian distance-based phylogenetics for the genomics era

TL;DR

This work addresses the computational bottlenecks of likelihood-based and Bayesian phylogenetic methods in the genome era by introducing a generalized Bayesian distance-based framework built on an entropic likelihood. The entropic approach defines an inter-taxa entropic distance and formulates a likelihood that is computationally efficient and closely related to Felsenstein's likelihood through a linear relationship that can be calibrated. The authors demonstrate that the entropic method yields Bayesian posteriors that align with distance-based bootstrap distributions on standard benchmarks, and scales to massive datasets such as a 60-million-site avian alignment, revealing substantial uncertainty in post-K-Pg diversification. They also provide analytical justifications for the near-linearity with Felsenstein's likelihood and establish a practical calibration (gradient ) to connect the entropic and traditional likelihoods, enabling robust, scalable phylogenetic inference in genomics-scale analyses. The practical impact lies in enabling uncertainty-aware, model-based phylogenetics on thousands of taxa and millions of sites, making genome-scale evolutionary inference feasible without sacrificing principled probabilistic interpretation.

Abstract

As whole genomes become widely available, maximum likelihood and Bayesian phylogenetic methods are demonstrating their limits in meeting the escalating computational demands. Conversely, distance-based phylogenetic methods are efficient, but are rarely favoured due to their inferior performance. Here, we extend distance-based phylogenetics using an entropy-based likelihood of the evolution among pairs of taxa, allowing for fast Bayesian inference in genome-scale datasets. We provide evidence of a close link between the inference criteria used in distance methods and Felsenstein's likelihood, such that the methods are expected to have comparable performance in practice. Using the entropic likelihood, we perform Bayesian inference on three phylogenetic benchmark datasets and find that estimates closely correspond with previous inferences. We also apply this rapid inference approach to a 60-million-site alignment from 363 avian taxa, covering most avian families. The method has outstanding performance and reveals substantial uncertainty in the avian diversification events immediately after the K-Pg transition event. The entropic likelihood allows for efficient Bayesian phylogenetic inference, accommodating the analysis demands of the genomic era.

Paper Structure

This paper contains 43 sections, 9 theorems, 120 equations, 6 figures, 3 tables.

Key Result

theorem 1

Define Consider taking $Kt \to 0$ while keeping the stationary distribution $\pi$ and the ratios $\frac{Q_{ab}}{Q_{cd}}$ constant (i.e. one can change $t$ or scale the matrix $Q$ by a constant multiple). For some fixed $k_1$ and variable $k_2 > k_1$ define a linear BME approximation to the entropic likelih Then, and, moreover, the percentage error is small

Figures (6)

  • Figure 1: Genetic distances $d_{ij}$ against corresponding entropic distances $\mathbb{E}(D^S_{ij}(d_{ij}))$ for three empirical datasets (see Table \ref{['tab:data']}, DS1 is blue, DS2 is red and DS3 is purple). For all datasets we see a strong linear relationship and correlation nearly equal to 1. Note that non-linearity tends to exist close to zero, where there is less data.
  • Figure 2: Performance of the entropic likelihood across evolutionary rates. Left: The variation of the scaling between the entropic and Felsenstein's likelihood as a function of rate, where red indicates the entropic distance estimated using Equation \ref{['eq:entropic_dist']} assuming exponentially distributed branch lengths. Blue points use the same procedure, but the branching rate $\theta$ is re-estimated for each tree and a new entropic distance found. Middle: Comparison of topological accuracy for Felsenstein's likelihood in black, entropic distance in red, and entropic distances when restimating $\theta$ in blue. Right: The mean absolute percentage error between Felsenstein's likelihood and a linear model of entropic distance. The percentage error is on a log likelihood scale. Black dotted lines show where most empirical data exist Klopfstein2017-uv.
  • Figure 3: Comparison of entropic and Felsenstein's likelihoods assuming exponential branch lengths. (a) Example single true tree from simulation. (b) A comparison of the likelihoods of suboptimal trees inferred from data simulated through the true tree. Black dots are suboptimal trees generated by performing random SPR moves from the best estimated tree, and red dots are entirely random trees. (c) A comparison of the likelihoods for the best tree across 2000 simulated alignments.
  • Figure 4: Kullback-Leibler divergence across weight values for our likelihood, for 1,000 50 taxa alignments with 5,000 sites, generated from a birth-death process. The Kullback-Leibler divergence is generated from a random sample of trees perturbed from the optimal tree via SPR trees. (a) shows the standard entropic likelihood which suffers from misspecification due to a variety of approximation errors, (b) shows the calibrated entropic likelihood which does not suffer from substantial misspecification
  • Figure 5: For the set of unique trees, the generalised Robinson-Foulds distance is calculated, and the distance matrix reduced by multidimensional scaling. A Jukes-Cantor model was used. ML Bootstrap: RAxML-NG kozlov2019 bootstrap, BME bootstrap: FastME lefort2015 bootstrap, BME MCMC is the method presented in this paper, and BME and MLE are the point estimates. To facilitate visualisation, only a random sample of 5000 trees is shown.
  • ...and 1 more figures

Theorems & Definitions (9)

  • theorem 1
  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7
  • lemma 8