Table of Contents
Fetching ...

PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders

Tianyu Xie, Harry Richman, Jiansi Gao, Frederick A. Matsen, Cheng Zhang

TL;DR

PhyloVAE tackles the challenge of representing and generatively modeling discrete phylogenetic tree topologies by introducing a linear-time encoding from trees to vectors and a non-autoregressive variational autoencoder that leverages learnable topological features. It provides a probabilistic framework where a latent variable z with a standard Gaussian prior explains tree topologies through p(s(τ)|z), while q(z|τ) is inferred from topology-aware embeddings, trained with a multi-sample bound L_K using the reparameterization trick. The approach yields a visualization-friendly latent space and enables high-resolution density estimation of tree topologies, outperforming autoregressive baselines in speed and matching or surpassing existing methods in representation quality. Across simulated and real data, PhyloVAE demonstrates robust latent separation of topology shapes, convergence signals across multiple analyses, and scalable generative modeling on benchmark datasets, with practical implications for phylogenetic placement and comparative analyses.

Abstract

Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (PhyloVAEs), an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. Leveraging an efficient encoding mechanism inspired by autoregressive tree topology generation, we develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. PhyloVAE combines this generative model with a collaborative inference model based on learnable topological features, allowing for high-resolution representations of phylogenetic tree samples. Extensive experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.

PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders

TL;DR

PhyloVAE tackles the challenge of representing and generatively modeling discrete phylogenetic tree topologies by introducing a linear-time encoding from trees to vectors and a non-autoregressive variational autoencoder that leverages learnable topological features. It provides a probabilistic framework where a latent variable z with a standard Gaussian prior explains tree topologies through p(s(τ)|z), while q(z|τ) is inferred from topology-aware embeddings, trained with a multi-sample bound L_K using the reparameterization trick. The approach yields a visualization-friendly latent space and enables high-resolution density estimation of tree topologies, outperforming autoregressive baselines in speed and matching or surpassing existing methods in representation quality. Across simulated and real data, PhyloVAE demonstrates robust latent separation of topology shapes, convergence signals across multiple analyses, and scalable generative modeling on benchmark datasets, with practical implications for phylogenetic placement and comparative analyses.

Abstract

Learning informative representations of phylogenetic tree structures is essential for analyzing evolutionary relationships. Classical distance-based methods have been widely used to project phylogenetic trees into Euclidean space, but they are often sensitive to the choice of distance metric and may lack sufficient resolution. In this paper, we introduce phylogenetic variational autoencoders (PhyloVAEs), an unsupervised learning framework designed for representation learning and generative modeling of tree topologies. Leveraging an efficient encoding mechanism inspired by autoregressive tree topology generation, we develop a deep latent-variable generative model that facilitates fast, parallelized topology generation. PhyloVAE combines this generative model with a collaborative inference model based on learnable topological features, allowing for high-resolution representations of phylogenetic tree samples. Extensive experiments demonstrate PhyloVAE's robust representation learning capabilities and fast generation of phylogenetic tree topologies.

Paper Structure

This paper contains 40 sections, 1 theorem, 15 equations, 14 figures, 8 tables, 3 algorithms.

Key Result

Theorem 1

Given a tree topology $\tau$ with $N$ leaf nodes, the time complexity of computing its encoding vector $\bm{s}(\tau)$ is $O(N)$.

Figures (14)

  • Figure 1: An example of a tree topology with six leaf nodes. The labels of leaf nodes are $\{\textrm{A,B,C,D,E,F}\}$.
  • Figure 2: The decomposition loop and reconstruction loop for encoding the tree topology with leaf nodes $\mathcal{X}=\{\textrm{A,B,C,D,E,F}\}$ in Figure \ref{['fig:tree']}. Starting from the tree topology in the upper left, we remove the pendant edges $f_6, f_5,f_4$ (associated with the leaf nodes F, E, and D) sequentially, and record edge decision $e_5,e_4,e_3$. Then starting from the three-leaf tree topology in the lower right, we add back $f_4,f_5,f_6$ and index these nodes (except for the root) sequentially. The resulting encoding vector is $(3,7,5)$, which are the indexes associated with $e_3,e_4,e_5$.
  • Figure 3: Performance of PhyloVAE for structural representation on simulated data sets. Left: A visualization of the 2D latent manifold for the data set of five-leaf tree topologies. $\Phi(\cdot)$ refers to the cumulative density function of the one-dimensional standard Gaussian distribution. Different colors represent the first edge decision and different transparencies of each color represent the second edge decision. Middle: Representation vectors of all the eight-leaf tree topologies. The scatter size is proportional to the probability of the corresponding tree topology. Right: Pairwise scatter plot, linear regression, and Pearson correlation coefficients between different distance metrics of tree topologies. $L^2$ = Euclidean distance in PhyloVAE latent space, RF = Robinson-Foulds, PD = Path difference.
  • Figure 4: Performances of PhyloVAE and MDS plot for representing real phylogenies. Left/Middle: Latent representations of the posterior mammal gene trees for five genes with different lengths. The scatter size is proportional to the probability of the tree topology. Right: Latent representations of samples of tree topologies from five independent BEAST runs on the rabies data set.
  • Figure 5: Runtime comparison between ARTree and PhyloVAE ($d=10$) with $K=32$ particles. Training means 10 training iterations. Generation means generating 100 tree topologies.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Definition 1: Ordinal Tree Topology
  • Definition 2: Robinson-Foulds distance; robinson1981comparison
  • Definition 3: Path difference distance; steel1993pathdifference