Table of Contents
Fetching ...

PhyloGFN: Phylogenetic inference with generative flow networks

Mingyang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai Zhang, Moksh Jain, Mathieu Blanchette, Yoshua Bengio

TL;DR

The framework of generative flow networks (GFlowNets) is adopted to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference and it is demonstrated that the amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets.

Abstract

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.

PhyloGFN: Phylogenetic inference with generative flow networks

TL;DR

The framework of generative flow networks (GFlowNets) is adopted to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference and it is demonstrated that the amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets.

Abstract

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.
Paper Structure (56 sections, 2 theorems, 20 equations, 12 figures, 11 tables)

This paper contains 56 sections, 2 theorems, 20 equations, 12 figures, 11 tables.

Key Result

Lemma 1

Let $s_1=\{(z_1, b_{1}), (z_2, b_{2}) \dots (z_l,b_{l}) \}$ and $s_2=\{(z'_1, b'_{1}), (z'_2, b’_{2}) \dots (z'_l,b‘_{l}) \}$ be two non-terminating states sharing the same features $\rho_{i} =\rho'_i$. Let a be the action that joins the trees with indices $(v,w)$ to form a new tree indexed $u$ with

Figures (12)

  • Figure 1: Left: PhyloGFN's state space on a four-sequence dataset. Initial state $s_0$ comprises leaf nodes. Successive steps merge pairs of trees until a single unrooted tree remains. Right: Policy model for PhyloGFN-Bayesian. Transformer encoder processes tree-level features $s_i=((z_1,b_1) \dots (z_l,b_l))$. Pairwise features ${e_{ij}}$ are derived and used by MLPs to select tree pairs for merging and sample branch lengths.
  • Figure 2: Model sampling log-density vs. unnormalized posterior log-density for high/medium/low-probability trees on DS1. We highlight that PhyloGFN-Bayesian performs significantly better on medium- and low-probability trees, highlighting its superiority in modeling the entire data space.
  • Figure 3: A temperature-conditioned PhyloGFN is trained on DS1 using temperatures sampled between 4.0 and 1.0. (A) Parsimony score distribution with varying statistical temperature input (8.0 to 1.0) to PhyloGFN policy (10,000 trees sampled per temperature). (B) PhyloGFN achieves high Pearson correlation at each temperature.
  • Figure S1: Sampling log density vs. ground truth unnormalized posterior log density for DS2-DS8
  • Figure S2: (cont.)
  • ...and 7 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Lemma 2
  • proof