A Vector Representation for Phylogenetic Trees

Cedric Chauve; Caroline Colijn; Louxin Zhang

A Vector Representation for Phylogenetic Trees

Cedric Chauve, Caroline Colijn, Louxin Zhang

TL;DR

The paper tackles the challenge of efficiently representing and comparing large phylogenetic trees. It introduces a vector encoding of rooted binary trees on $n$ taxa as a length-$2n$ sequence with each taxon appearing twice, and defines a novel HOP rearrangement operator whose distance can be computed in near-linear time using a Longest Common Subsequence variant. The approach yields a one-to-one tree–vector correspondence, demonstrates storage advantages over Newick, and shows the HOP distance tracks SPR-based dissimilarity more closely than RF, with extensions to polytomies and tree-child networks. The work suggests practical benefits for scalable tree inference, sampling, and potentially accelerating machine learning-based phylogenetic methods, while highlighting ordering-dependence and avenues for mHOP-based improvement.

Abstract

Good representations for phylogenetic trees and networks are important for optimizing storage efficiency and implementation of scalable methods for the inference and analysis of evolutionary trees for genes, genomes and species. We introduce a new representation for rooted phylogenetic trees that encodes a binary tree on n taxa as a vector of length 2n in which each taxon appears exactly twice. Using this new tree representation, we introduce a novel tree rearrangement operator, called a HOP, that results in a tree space of diameter n and a quadratic neighbourhood size. We also introduce a novel metric, the HOP distance, which is the minimum number of HOPs to transform a tree into another tree. The HOP distance can be computed in near-linear time, a rare instance of a tree rearrangement distance that is tractable. Our experiments show that the HOP distance is better correlated to the Subtree-Prune-and-Regraft distance than the widely used Robinson-Foulds distance. We also describe how the novel tree representation we introduce can be further generalized to tree-child networks.

A Vector Representation for Phylogenetic Trees

TL;DR

The paper tackles the challenge of efficiently representing and comparing large phylogenetic trees. It introduces a vector encoding of rooted binary trees on

taxa as a length-

sequence with each taxon appearing twice, and defines a novel HOP rearrangement operator whose distance can be computed in near-linear time using a Longest Common Subsequence variant. The approach yields a one-to-one tree–vector correspondence, demonstrates storage advantages over Newick, and shows the HOP distance tracks SPR-based dissimilarity more closely than RF, with extensions to polytomies and tree-child networks. The work suggests practical benefits for scalable tree inference, sampling, and potentially accelerating machine learning-based phylogenetic methods, while highlighting ordering-dependence and avenues for mHOP-based improvement.

Abstract

Paper Structure (18 sections, 6 theorems, 3 equations, 9 figures, 1 table)

This paper contains 18 sections, 6 theorems, 3 equations, 9 figures, 1 table.

Introduction
Preliminaries
Phylogenetic trees.
Phylogenetic networks.
Vectors and Longest Common Subsequence.
Results
Vector representation of rooted phylogenetic trees
The HOP tree rearrangement operator
The HOP distance for phylogenetic trees
Generalizations
Trees with polytomies
Tree-child networks
Experiments
Encoding phylogenetic trees.
HOP neighbourhood size.
...and 3 more sections

Key Result

Theorem 1

There is a one-to-one correspondence between tree representations vectors on $X=\{1,2,\dots,n\}$ and rooted phylogenetic trees on $X$.

Figures (9)

Figure 1: The fifteen trees on $X=\{1, 2, 3, 4\}$ and their representations. Here, commas are omitted in the tree representations, in which the second occurrences of the taxa are colored and underlined.
Figure 2: An illustration of the encoding of a phylogenetic tree into a tree representation. (a) A phylogenetic tree $T$ on $X=\{1, 2, 3, 4, 5\}$. (b) The labeling of internal nodes (step 2 of the encoding algorithm). (c) The decomposition of the tree into the paths from the non-leaf node labelled $i$ to the leaf $i$. (d) The tree representation $\mathbf{v}(T)$, where the second copy of each taxon is underlined.
Figure 3: An illustration of the SPR operator.
Figure 4: An illustration of the HOP operator that moves $3$ before $\underline{2}$, and its effect as an SPR rearrangement. The only other HOP rearrangement moving $3$ would move it before $\underline{1}$.
Figure 5: Illustration of the procedure of encoding tree-child networks. The five tree components of the network are highlighted in blue and grey.
...and 4 more figures

Theorems & Definitions (11)

Definition 1
Theorem 1
Definition 2
Definition 3
Proposition 1
Definition 4
Definition 5
Theorem 2
Proposition 2
Proposition 3
...and 1 more

A Vector Representation for Phylogenetic Trees

TL;DR

Abstract

A Vector Representation for Phylogenetic Trees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (11)