Table of Contents
Fetching ...

Leaping through tree space: continuous phylogenetic inference for rooted and unrooted trees

Matthew J Penn, Neil Scheidwasser, Joseph Penn, Christl A Donnelly, David A Duchêne, Samir Bhatt

TL;DR

By reframing phylogenetic tree inference as a differentiable, continuous optimization over ordered-tree distributions, GradME leverages Phylo2Vec to enable large topological jumps while optimizing a continuous balanced minimum evolution objective $F(W)$ with gradient-based methods. The framework supports both rooted and unrooted trees, derives rooting heuristics under ultrametric conditions, and introduces Queue Shuffle for principled exploration of leaf orderings; it outperforms unrooted FastME on benchmarks and can accurately root ultrametric trees using surprisingly small, clock-like data. The approach is complemented by a discrete hill-climbing alternative and open-source implementations, highlighting a new direction for efficient, differentiable phylogenetic inference with potential integration into Bayesian paradigms. Overall, GradME broadens the toolkit for challenging data-deficient phylogenetic questions by enabling large-scale optimization over tree space and providing practical rooting capabilities for clock-like datasets.

Abstract

Phylogenetics is now fundamental in life sciences, providing insights into the earliest branches of life and the origins and spread of epidemics. However, finding suitable phylogenies from the vast space of possible trees remains challenging. To address this problem, for the first time, we perform both tree exploration and inference in a continuous space where the computation of gradients is possible. This continuous relaxation allows for major leaps across tree space in both rooted and unrooted trees, and is less susceptible to convergence to local minima. Our approach outperforms the current best methods for inference on unrooted trees and, in simulation, accurately infers the tree and root in ultrametric cases. The approach is effective in cases of empirical data with negligible amounts of data, which we demonstrate on the phylogeny of jawed vertebrates. Indeed, only a few genes with an ultrametric signal were generally sufficient for resolving the major lineages of vertebrates. Optimisation is possible via automatic differentiation and our method presents an effective way forwards for exploring the most difficult, data-deficient phylogenetic questions.

Leaping through tree space: continuous phylogenetic inference for rooted and unrooted trees

TL;DR

By reframing phylogenetic tree inference as a differentiable, continuous optimization over ordered-tree distributions, GradME leverages Phylo2Vec to enable large topological jumps while optimizing a continuous balanced minimum evolution objective with gradient-based methods. The framework supports both rooted and unrooted trees, derives rooting heuristics under ultrametric conditions, and introduces Queue Shuffle for principled exploration of leaf orderings; it outperforms unrooted FastME on benchmarks and can accurately root ultrametric trees using surprisingly small, clock-like data. The approach is complemented by a discrete hill-climbing alternative and open-source implementations, highlighting a new direction for efficient, differentiable phylogenetic inference with potential integration into Bayesian paradigms. Overall, GradME broadens the toolkit for challenging data-deficient phylogenetic questions by enabling large-scale optimization over tree space and providing practical rooting capabilities for clock-like datasets.

Abstract

Phylogenetics is now fundamental in life sciences, providing insights into the earliest branches of life and the origins and spread of epidemics. However, finding suitable phylogenies from the vast space of possible trees remains challenging. To address this problem, for the first time, we perform both tree exploration and inference in a continuous space where the computation of gradients is possible. This continuous relaxation allows for major leaps across tree space in both rooted and unrooted trees, and is less susceptible to convergence to local minima. Our approach outperforms the current best methods for inference on unrooted trees and, in simulation, accurately infers the tree and root in ultrametric cases. The approach is effective in cases of empirical data with negligible amounts of data, which we demonstrate on the phylogeny of jawed vertebrates. Indeed, only a few genes with an ultrametric signal were generally sufficient for resolving the major lineages of vertebrates. Optimisation is possible via automatic differentiation and our method presents an effective way forwards for exploring the most difficult, data-deficient phylogenetic questions.
Paper Structure (11 sections, 14 theorems, 73 equations, 6 figures, 2 tables, 2 algorithms)

This paper contains 11 sections, 14 theorems, 73 equations, 6 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Consider adding an extra taxon, $n$, to the set of taxa such that, for some $D^*$ and $\delta$. This creates an unrooted tree $\mathcal{T}^u$, and we can create a rooted tree $\mathcal{T}^r$ by removing node $n$. If $e^u_{ij}$ denotes inter-taxa distance in $\mathcal{T}^u$ and $e_{ij}^r$ denotes inter-taxa distance in $\mathcal{T}^r$, then

Figures (6)

  • Figure 1: Results on empirical data (a) Starting from a random tree, represented by an $n\times n$ stochastic matrix, we compute the continuous gradient, apply softmax activation and increment the original matrix. In a single step, our gradient finds the correct tree at a distance of 6 subtree-prune and regraft moves from the random starting tree. (b) Simulating ultrametric trees of 20 taxa and 100,000 sites under an LG model of protein evolution. We add random uniform noise to all branch lengths to simulate departures from ultrametricity. Compared to the true tree via Robinson-Foulds distance, light blue bars are midpoint rooting the best FastME tree and dark blue bars are the inferred root from our approach. (c) Phylogenies for jawed vertebrates, where the number of genes (hence sites) are reduced to be more clocklike. Normalised Robinson-Foulds distance are shown between the best ASTRALZhang2018-iw tree, the best unrooted FastME tree which has been midpoint rooted (light blue) and our inferred rooting algorithm (dark blue). Performance for FastME reduces when the number of sites is small.
  • Figure 2: Phylogenetic inferences of the jawed vertebrates' phylogeny using the two most ultrametric loci from a data set of 99 taxa and 4593 genes irisarri2017. (a) Inference using our approach leads to high accuracy in identifying the root and all major jawed vertebrate taxa. Note that, we do not estimate branch lengths, but only topology via balanced minimum evolution (b) inference using FastME and midpoint rooting leads to widespread error, primarily and critically near the root of the process.
  • Figure 3: An example of the left-to-right construction of the ordered tree $v = [0,0,0,2]$.
  • Figure S1: A simple representation of the swapping property of Queue Shuffle. The swapped subtrees have roots labelled $S_i$ and $S_j$ while the node label $c$ is the subject of the equivalent subtree-prune and regraft operation. Note that in step 4, the internal nodes are renamed to illustrate that the required swap has indeed occurred.
  • Figure S2: Comparison of the best unrooted FastME tree that has been midpoint rooted to an optimised rooted tree via Queue Shuffle on an Eutherian mammal dataset Song2012-ql. Queue Shuffle correctly places Gallus gallus as the outgroup of mammals. Branch lengths are ignored and trees are displayed as ultrametric.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Definition 1
  • Lemma 1
  • Definition 2
  • Lemma 2
  • Definition 3
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • ...and 9 more