Table of Contents
Fetching ...

PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

ChenRui Duan, Zelin Zang, Siyuan Li, Yongjie Xu, Stan Z. Li

TL;DR

PhyloGen addresses the challenge of jointly inferring phylogenetic topology and branch lengths by leveraging a pre-trained genomic language model to generate informative genome embeddings and latent representations. It frames inference as a conditional-constrained tree structure generation problem and integrates three modules—Feature Extraction, PhyloTree Construction, and PhyloTree Structure Modeling—together with a novel Scoring Function to stabilize gradient descent. Through variational inference with a multi-sample ELBO and end-to-end SGD, PhyloGen achieves state-of-the-art ELBO and MLL on eight benchmark datasets, while delivering diverse, topology-consistent trees that align with gold-standard MrBayes bipartitions. The approach removes the need for aligned sequences or predefined evolutionary models, enabling robust phylogenetic insights and scalable analysis across variable-length genomic data, with potential applicability to broader biological data modalities in the future.

Abstract

Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.

PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation

TL;DR

PhyloGen addresses the challenge of jointly inferring phylogenetic topology and branch lengths by leveraging a pre-trained genomic language model to generate informative genome embeddings and latent representations. It frames inference as a conditional-constrained tree structure generation problem and integrates three modules—Feature Extraction, PhyloTree Construction, and PhyloTree Structure Modeling—together with a novel Scoring Function to stabilize gradient descent. Through variational inference with a multi-sample ELBO and end-to-end SGD, PhyloGen achieves state-of-the-art ELBO and MLL on eight benchmark datasets, while delivering diverse, topology-consistent trees that align with gold-standard MrBayes bipartitions. The approach removes the need for aligned sequences or predefined evolutionary models, enabling robust phylogenetic insights and scalable analysis across variable-length genomic data, with potential applicability to broader biological data modalities in the future.

Abstract

Phylogenetic trees elucidate evolutionary relationships among species, but phylogenetic inference remains challenging due to the complexity of combining continuous (branch lengths) and discrete parameters (tree topology). Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens. Existing Variational Inference methods, which require pre-generated topologies and typically treat tree structures and branch lengths independently, may overlook critical sequence features, limiting their accuracy and flexibility. We propose PhyloGen, a novel method leveraging a pre-trained genomic language model to generate and optimize phylogenetic trees without dependence on evolutionary models or aligned sequence constraints. PhyloGen views phylogenetic inference as a conditionally constrained tree structure generation problem, jointly optimizing tree topology and branch lengths through three core modules: (i) Feature Extraction, (ii) PhyloTree Construction, and (iii) PhyloTree Structure Modeling. Meanwhile, we introduce a Scoring Function to guide the model towards a more stable gradient descent. We demonstrate the effectiveness and robustness of PhyloGen on eight real-world benchmark datasets. Visualization results confirm PhyloGen provides deeper insights into phylogenetic relationships.

Paper Structure

This paper contains 41 sections, 33 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison of PhyloTree Tree Inference Methods.(a) The inputs are aligned sequences, and topologies are learned from existing tree structures using methods like SBNs, which rely on MCMC-based methods for pre-generated candidate trees without considering branch lengths directly. (b) The inputs are aligned sequences, and then tree structures and branch lengths are directly inferred by variational inference and biological modules. These methods optimize tree topology and branch lengths separately. (c) The inputs are raw sequences processed by a pre-trained language model to generate species representations. Then, an initial topology is generated through a tree construction module, and the topology and branch lengths are co-optimized by the tree structure modeling module.
  • Figure 2: Framework of PhyloGen.A. Feature Extraction module extracts genome embeddings $E$ from raw sequences $Y$ using a pre-trained language model. B. PhyloTree Construction module uses $E$ to compute topological parameters, which generate an initial tree structure $\tau^*$ via the Neighbor-Joining algorithm. C. PhyloTree Structure Modeling module jointly model $\tau$ and $B_\tau$ through the topology learning component (TreeEncoder $R$ and TreeDecoder $Q$) and the branch length (Blens) learning component (dual-pass traversal, DGCNN network, Blens reparameterization).
  • Figure 3: Comparison of ELBO and Scoring Function over Training Steps on DS1. Closer curves mean better.
  • Figure 3: Diversity of tree topologies.
  • Figure 4: Comparison of ELBO and MLL Metrics for DS1 Dataset with Different Baselines.
  • ...and 8 more figures