Table of Contents
Fetching ...

Vector encoding of phylogenetic trees by ordered leaf attachment

Harry Richman, Cheng Zhang, Frederick A. Matsen

TL;DR

The paper introduces the ordered leaf attachment (OLA) encoding, a linear-time, bijective mapping between rooted, binary, leaf-ordered trees on $n$ leaves and a simple integer-vector space $\mathcal{C}_{n-1}$. It defines the OLA distance as the Hamming distance between encodings and shows that, while a single $NNI$ move can yield a large worst-case distance, the expected distance under random $NNI$ moves is bounded by 4, and it conjectures a height-based bound for $SPR$ moves with supporting empirical evidence. The methodology relies on canonical internal labeling via clade-founder and clade-splitter labels, enabling efficient encode/decode in $O(n)$ time and providing a practical, structure-preserving vector representation suitable for machine learning and tree-space exploration. The approach positions OLA relative to prior encodings (e.g., Phylo2Vec, LTS) while highlighting leaf-ordering dependencies and potential extensions to unrooted trees and branch lengths. Overall, OLA offers a simple, scalable framework for representing and comparing phylogenetic trees in vector form, with implications for learning, sampling, and downstream analyses.

Abstract

As part of work to connect phylogenetics with machine learning, there has been considerable recent interest in vector encodings of phylogenetic trees. We present a simple new "ordered leaf attachment" (OLA) method for uniquely encoding a binary, rooted phylogenetic tree topology as an integer vector. OLA encoding and decoding take linear time in the number of leaf nodes, and the set of vectors corresponding to trees is a simply-described subset of integer sequences. The OLA encoding is unique compared to other existing encodings in having these properties. The integer vector encoding induces a distance on the set of trees, and we investigate this distance in relation to the NNI and SPR distances.

Vector encoding of phylogenetic trees by ordered leaf attachment

TL;DR

The paper introduces the ordered leaf attachment (OLA) encoding, a linear-time, bijective mapping between rooted, binary, leaf-ordered trees on leaves and a simple integer-vector space . It defines the OLA distance as the Hamming distance between encodings and shows that, while a single move can yield a large worst-case distance, the expected distance under random moves is bounded by 4, and it conjectures a height-based bound for moves with supporting empirical evidence. The methodology relies on canonical internal labeling via clade-founder and clade-splitter labels, enabling efficient encode/decode in time and providing a practical, structure-preserving vector representation suitable for machine learning and tree-space exploration. The approach positions OLA relative to prior encodings (e.g., Phylo2Vec, LTS) while highlighting leaf-ordering dependencies and potential extensions to unrooted trees and branch lengths. Overall, OLA offers a simple, scalable framework for representing and comparing phylogenetic trees in vector form, with implications for learning, sampling, and downstream analyses.

Abstract

As part of work to connect phylogenetics with machine learning, there has been considerable recent interest in vector encodings of phylogenetic trees. We present a simple new "ordered leaf attachment" (OLA) method for uniquely encoding a binary, rooted phylogenetic tree topology as an integer vector. OLA encoding and decoding take linear time in the number of leaf nodes, and the set of vectors corresponding to trees is a simply-described subset of integer sequences. The OLA encoding is unique compared to other existing encodings in having these properties. The integer vector encoding induces a distance on the set of trees, and we investigate this distance in relation to the NNI and SPR distances.

Paper Structure

This paper contains 22 sections, 9 theorems, 10 equations, 16 figures, 3 algorithms.

Key Result

Theorem 1

For any $n \geq 2$, the OLA encoding and decoding algorithms (Algorithms alg:ola-encode and alg:ola-decode) define a pair of inverse bijections

Figures (16)

  • Figure 1: OLA-encoding a Yule-type tree: (0,0,2,1). Yule-type trees are those where each new leaf is added as a sister to a previously-added leaf, and for these trees the OLA encoding simply records the label of the sister leaf. At each step, the new leaf is highlighted in blue, and the sister leaf is highlighted in red.
  • Figure 2: OLA-encoding a non-Yule-type tree: (0,-1,1,-3). For non-Yule-type trees, we introduce a canonical labeling of internal nodes, where each internal node is labeled by $-i$ when the $i$-th leaf is added. Using this convention, we construct the OLA encoding by adding the label of the sister node at each step, whether or not that sister is a leaf. At each step, the new leaf is highlighted blue, and the sister node is highlighted red.
  • Figure 3: Two trees which differ by an NNI move, with large OLA distance.
  • Figure 4: Trees differing by an NNI move. For the purposes of the proof of Theorem \ref{['thm:nni-move-bound']}, the left tree is $T$ and the right tree is $T'$.
  • Figure 5: Two trees which differ by an SPR move, with large OLA distance.
  • ...and 11 more figures

Theorems & Definitions (20)

  • Theorem 1
  • Definition 2
  • Theorem 3
  • proof : Proof of Theorem \ref{['thm:nni-move-bound']}
  • Definition 4
  • Conjecture 5
  • Proposition 6
  • proof
  • Theorem 7
  • proof
  • ...and 10 more