Table of Contents
Fetching ...

Phylo2Vec: a vector representation for binary trees

Matthew J Penn, Neil Scheidwasser, Mark P Khurana, David A Duchêne, Christl A Donnelly, Samir Bhatt

TL;DR

Phylo2Vec introduces a bijective, compact encoding that maps a binary rooted tree with $n$ leaves to a unique integer vector $\boldsymbol{v}$ of length $n-1$, where each $v_j \in \{0,\ldots,2(j-1)\}$. This representation enables fast tree sampling, unambiguous topology verification, and a natural Hamming-based distance between trees, supporting systematic exploration of tree space beyond traditional SPR heuristics. The authors prove bijectivity and demonstrate the utility of Phylo2Vec through a hill-climbing maximum-likelihood inference on five empirical datasets, achieving consistent optima from random starts and showing substantial gains in speed and storage over standard formats. The approach promises integration with broader ML and Bayesian frameworks, with potential extensions to gradient-based optimization, MCTS, and irreversible-model phylogenetics, offering a scalable path for complex phylogenetic inference.

Abstract

Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with $n$ leaves to a unique integer vector of length $n-1$. The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.

Phylo2Vec: a vector representation for binary trees

TL;DR

Phylo2Vec introduces a bijective, compact encoding that maps a binary rooted tree with leaves to a unique integer vector of length , where each . This representation enables fast tree sampling, unambiguous topology verification, and a natural Hamming-based distance between trees, supporting systematic exploration of tree space beyond traditional SPR heuristics. The authors prove bijectivity and demonstrate the utility of Phylo2Vec through a hill-climbing maximum-likelihood inference on five empirical datasets, achieving consistent optima from random starts and showing substantial gains in speed and storage over standard formats. The approach promises integration with broader ML and Bayesian frameworks, with potential extensions to gradient-based optimization, MCTS, and irreversible-model phylogenetics, offering a scalable path for complex phylogenetic inference.

Abstract

Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with leaves to a unique integer vector of length . The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.
Paper Structure (21 sections, 1 theorem, 7 equations, 9 figures, 1 table, 5 algorithms)

This paper contains 21 sections, 1 theorem, 7 equations, 9 figures, 1 table, 5 algorithms.

Key Result

lemma 1

The mapping between the set of vectors $\boldsymbol{v}\in \mathbb{V}$ and the set of (topologically equivalent, labelled) trees is a bijection.

Figures (9)

  • Figure 1: An incomplete integer representation of tree topology as birth processes. (a) Labelling a tree as an ordered vector: example for $\boldsymbol{v}$$=[0, 0, 0]$. We process leaves in ascending order. For each leaf $j$, we retrieve its sibling (or adjacent tip) in the Newick string, ignoring leaves > $j$. The adjacent tip corresponds to $\boldsymbol{v}$$[j]$. (b) Recovering a tree from an ordered vector: example for $\boldsymbol{v}$$= [0, 0, 1]$. We process $\boldsymbol{v}$ from left to right. Ancestors are named in last-in-first-out (LIFO) fashion: The ancestor of the last added leaf $L-1$ (here, leaf 3) is named $L$ (here, 4), the ancestor of the second-to-last added leaf $L-2$ (here, leaf 2) is named $L+1$ (here, 5) etc. In both cases, the lengths of the edges are arbitrary.
  • Figure 2: Recovering a tree from a Phylo2Vec vector: example for $\boldsymbol{v}$$= [0, 2, 2, 5, 2]$. (a) Main algorithm for leaf placement. Initially, we consider a tree with two leaves labelled 0 and 1 and an extra temporary root, which ensures that there are $2j - 1$ entries for any position $j \in \{1, \ldots, n-1\}$. This state correspond to $\boldsymbol{v}$ = [0]. We then process $\boldsymbol{v}$ from left to right. $\boldsymbol{v}$[$j$] denotes the branch to be split, yielding a new leaf $j$. At the end of each iteration, a branch labelling step described in (b) is performed. First, branches leading to leaves 0, ..., $n-1$ are labelled 0, ..., $n-1$, respectively. Second, the temporary root is always labelled as $2(n-1)$. Third, for internal branches, the next branch ($n$) to label is the branch of a cherry with the highest label $c_{max}$. We then prune out the leaf $c_{max}$, and repeat the same process for internal branches $n+1, ..., 2(n-1) - 1$. See Figure \ref{['fig:cpxity_to_newick']} for more details about complexity.
  • Figure 3: Example of trees with $n=4$ leaves represented in both Newick and Phylo2Vec vector formats. Nodes 0-3 and 4-6 respectively denote the leaves and internal nodes.
  • Figure 5: Comparison of Phylo2Vec moves with three popular tree distances: subtree-prune-and-regraft (SPR; left), Robinson-Foulds (RF; middle), and Kuhner-Felsenstein (KF; right). To generate the distances, a random walk of 5000 steps was performed from a random initial $\boldsymbol{v}$ with 200 taxa. At each step, each $v_i$ can increment, decrement or remain unchanged.
  • Figure 6: Example of a reordering scheme of $\boldsymbol{v}$ using level-order traversal. Starting from the root, for each level, we relabel the immediately descending leaf nodes with the smallest integers available (from 0 to $n-1$). The letters (a-g) indicate the taxa, showing that reordering the leaves does not affect tree topology but simply changes the integer-taxon mapping.
  • ...and 4 more figures

Theorems & Definitions (1)

  • lemma 1