Table of Contents
Fetching ...

SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction

Ortal Reshef, Ofer Glassman, Or Zuk, Yariv Aizenbud, Boaz Nadler, Ariel Jaffe

TL;DR

The theoretical analysis and empirical evaluation show that combining SDSR with common species tree methods, such as CA-ML or ASTRAL, yields up to 10-fold faster runtimes and achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.

Abstract

Recovering a tree that represents the evolutionary history of a group of species is a key task in phylogenetics. Performing this task using sequence data from multiple genetic markers poses two key challenges. The first is the discordance between the evolutionary history of individual genes and that of the species. The second challenge is computational, as contemporary studies involve thousands of species. Here we present SDSR, a scalable divide-and-conquer approach for species tree reconstruction based on spectral graph theory. The algorithm recursively partitions the species into subsets until their sizes are below a given threshold. The trees of these subsets are reconstructed by a user-chosen species tree algorithm. Finally, these subtrees are merged to form the full tree. On the theoretical front, we derive recovery guarantees for SDSR, under the multispecies coalescent (MSC) model. We also perform a runtime complexity analysis. We show that SDSR, when combined with a species tree reconstruction algorithm as a subroutine, yields substantial runtime savings as compared to applying the same algorithm on the full data. Empirically, we evaluate SDSR on synthetic benchmark datasets with incomplete lineage sorting and horizontal gene transfer. In accordance with our theoretical analysis, the simulations show that combining SDSR with common species tree methods, such as CA-ML or ASTRAL, yields up to 10-fold faster runtimes. In addition, SDSR achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.

SDSR: A Spectral Divide-and-Conquer Approach for Species Tree Reconstruction

TL;DR

The theoretical analysis and empirical evaluation show that combining SDSR with common species tree methods, such as CA-ML or ASTRAL, yields up to 10-fold faster runtimes and achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.

Abstract

Recovering a tree that represents the evolutionary history of a group of species is a key task in phylogenetics. Performing this task using sequence data from multiple genetic markers poses two key challenges. The first is the discordance between the evolutionary history of individual genes and that of the species. The second challenge is computational, as contemporary studies involve thousands of species. Here we present SDSR, a scalable divide-and-conquer approach for species tree reconstruction based on spectral graph theory. The algorithm recursively partitions the species into subsets until their sizes are below a given threshold. The trees of these subsets are reconstructed by a user-chosen species tree algorithm. Finally, these subtrees are merged to form the full tree. On the theoretical front, we derive recovery guarantees for SDSR, under the multispecies coalescent (MSC) model. We also perform a runtime complexity analysis. We show that SDSR, when combined with a species tree reconstruction algorithm as a subroutine, yields substantial runtime savings as compared to applying the same algorithm on the full data. Empirically, we evaluate SDSR on synthetic benchmark datasets with incomplete lineage sorting and horizontal gene transfer. In accordance with our theoretical analysis, the simulations show that combining SDSR with common species tree methods, such as CA-ML or ASTRAL, yields up to 10-fold faster runtimes. In addition, SDSR achieves a comparable tree reconstruction accuracy to that obtained by applying these methods on the full data.
Paper Structure (45 sections, 20 theorems, 89 equations, 13 figures, 1 algorithm)

This paper contains 45 sections, 20 theorems, 89 equations, 13 figures, 1 algorithm.

Key Result

Theorem 1

Let $\mathop{\mathrm{\mathcal{T}}}\nolimits$ be a binary tree, whose similarity matrix $S$ has all entries between $0$ and $1$. Let $v$ be the Fiedler vector of the Laplacian matrix of $S$. Let $C_1$, $C_2$ be the partition of the terminal nodes of $\mathop{\mathrm{\mathcal{T}}}\nolimits$ according

Figures (13)

  • Figure 1: Discordance between a species tree and a gene tree due to one HGT event (orange) and one ILS event (purple). In HGT, a segment of genetic material is transferred horizontally from a donor species $A$, to a receptor species $C$hall2017samplingbrito2021examining. In ILS, the divergence event associated with species $E$ occur at an earlier time in the gene tree than in the species tree degnan2009gene.
  • Figure 2: SDSR workflow: (A) Compute a graph based on the genes' similarity matrices, where nodes represent species. The nodes are colored according to the partitions $C_1, C_2$ from step 1. (B) Form subsets $\tilde{C}_1$ and $\tilde{C}_2$ by adding outgroups $O_2,O_1$ to $C_1$ and $C_2$, respectively, and reconstruct subtrees $\tilde{\mathop{\mathrm{\mathcal{T}}}\nolimits}_1$ and $\tilde{\mathop{\mathrm{\mathcal{T}}}\nolimits}_2$. (C) The roots of the two trees are set to $h_1$ and $h_2$, the nodes adjacent to the outgroups $O_2$ and $O_1$. The outgroup nodes are removed, and the subtrees are merged by connecting $h_1$ to $h_2$.
  • Figure 3: Example of a species tree (blue) and a realization of a gene tree (black). The vertical line is the time axis starting at $\tau=0$ with the current species. The coalescence times in the species tree are $\tau_{ij}$ and $\tau_{ijk}$, whereas those in the gene tree are $\tau_{jk}^g$ and $\tau_{ijk}^g$. In this example, the gene tree and the species tree have different topologies.
  • Figure 4: Partition accuracy as a function of the number of genes, on the $10{,}000$-species datasets. Results shown for the first partition (left) and all partitions (right). The accuracy of the first partition is measured by the Rand Index similarity score to the most similar partition in the ground-truth species tree. For internal partitions, we compared the obtained partition to the most similar partition in the corresponding subtree.
  • Figure 5: Accuracy and runtime as a function of the number of genes for the $200$-species datasets. The upper panel presents the nRF distance between the estimated and ground truth species tree of CA-ML and SDSR using CA-ML as a subroutine. The upper right panel presents the runtime. The bottom panels present the accuracy and runtime of ASTRAL and SDSR using ASTRAL as a subroutine.
  • ...and 8 more figures

Theorems & Definitions (44)

  • Remark 1
  • Definition 1: Species Tree
  • Theorem 1: aizenbud2021spectral,Theorem 4.2
  • Definition 2: The $\mathop{\mathrm{\mathcal{T}}}\nolimits$ rank-1 condition
  • Definition 3: The $\mathop{\mathrm{\mathcal{T}}}\nolimits$ quartet condition
  • Lemma 1
  • Theorem 2
  • proof
  • Lemma 2
  • Remark 2
  • ...and 34 more