Vector encoding of phylogenetic trees by ordered leaf attachment
Harry Richman, Cheng Zhang, Frederick A. Matsen
TL;DR
The paper introduces the ordered leaf attachment (OLA) encoding, a linear-time, bijective mapping between rooted, binary, leaf-ordered trees on $n$ leaves and a simple integer-vector space $\mathcal{C}_{n-1}$. It defines the OLA distance as the Hamming distance between encodings and shows that, while a single $NNI$ move can yield a large worst-case distance, the expected distance under random $NNI$ moves is bounded by 4, and it conjectures a height-based bound for $SPR$ moves with supporting empirical evidence. The methodology relies on canonical internal labeling via clade-founder and clade-splitter labels, enabling efficient encode/decode in $O(n)$ time and providing a practical, structure-preserving vector representation suitable for machine learning and tree-space exploration. The approach positions OLA relative to prior encodings (e.g., Phylo2Vec, LTS) while highlighting leaf-ordering dependencies and potential extensions to unrooted trees and branch lengths. Overall, OLA offers a simple, scalable framework for representing and comparing phylogenetic trees in vector form, with implications for learning, sampling, and downstream analyses.
Abstract
As part of work to connect phylogenetics with machine learning, there has been considerable recent interest in vector encodings of phylogenetic trees. We present a simple new "ordered leaf attachment" (OLA) method for uniquely encoding a binary, rooted phylogenetic tree topology as an integer vector. OLA encoding and decoding take linear time in the number of leaf nodes, and the set of vectors corresponding to trees is a simply-described subset of integer sequences. The OLA encoding is unique compared to other existing encodings in having these properties. The integer vector encoding induces a distance on the set of trees, and we investigate this distance in relation to the NNI and SPR distances.
