Table of Contents
Fetching ...

Improving Gene Trees without more data

Ashu Gupta

TL;DR

The study tackles the challenge of estimating gene and species trees when loci have low phylogenetic signal and ILS confounds species-tree inference. It introduces WSB+WQMC, a pipeline that leverages binning-based binning-compatible quartet weighting and WQMC to generate per-gene refinements, with theoretical consistency under the $GTR+MSC$ model. Through extensive BestML analyses on diverse simulated datasets, WSB+WQMC shows substantial improvements in gene and species-tree accuracy at low-to-moderate ILS, while sometimes underperforming at high ILS compared to WSB+CAML or unbinned approaches. The results emphasize that binning-based strategies can boost phylogenetic signal in favorable regimes but require careful parameter tuning and further refinement to reliably handle high-ILS scenarios and real data.

Abstract

Estimating species and gene trees from sequence data is challenging. Gene tree estimation is often hampered by low phylogenetic signal in alignments, leading to inaccurate trees. Species tree estimation is complicated by incomplete lineage sorting (ILS), where gene histories differ from the species' history. Summary methods like MP-EST, ASTRAL2, and ASTRID infer species trees from gene trees but suffer when gene tree accuracy is low. To address this, the Statistical Binning (SB) and Weighted Statistical Binning (WSB) pipelines were developed to improve gene tree estimation. However, previous studies only tested these pipelines using multi-locus bootstrapping (MLBS), not the BestML approach. This thesis proposes a novel pipeline, WSB+WQMC, which shares design features with the existing WSB+CAML pipeline but has other desirable properties and is statistically consistent under the GTR+MSC model. This study evaluated WSB+WQMC against WSB+CAML using BestML analysis on various simulated datasets. The results confirmed many trends seen in prior MLBS analyses. WSB+WQMC substantially improved gene tree and species tree accuracy (using ASTRAL2 and ASTRID) on most datasets with low, medium, and moderately high ILS levels. In a direct comparison, WSB+WQMC computed less accurate trees than WSB+CAML under certain low and medium ILS conditions. However, WSB+WQMC performed better or at least as accurately as WSB+CAML on all datasets with moderately high and high ILS. It also proved better for estimating gene trees on some medium and low ILS datasets. Thus, WSB+WQMC is a promising alternative to WSB+CAML for phylogenetic estimation, especially in the presence of low phylogenetic signal.

Improving Gene Trees without more data

TL;DR

The study tackles the challenge of estimating gene and species trees when loci have low phylogenetic signal and ILS confounds species-tree inference. It introduces WSB+WQMC, a pipeline that leverages binning-based binning-compatible quartet weighting and WQMC to generate per-gene refinements, with theoretical consistency under the model. Through extensive BestML analyses on diverse simulated datasets, WSB+WQMC shows substantial improvements in gene and species-tree accuracy at low-to-moderate ILS, while sometimes underperforming at high ILS compared to WSB+CAML or unbinned approaches. The results emphasize that binning-based strategies can boost phylogenetic signal in favorable regimes but require careful parameter tuning and further refinement to reliably handle high-ILS scenarios and real data.

Abstract

Estimating species and gene trees from sequence data is challenging. Gene tree estimation is often hampered by low phylogenetic signal in alignments, leading to inaccurate trees. Species tree estimation is complicated by incomplete lineage sorting (ILS), where gene histories differ from the species' history. Summary methods like MP-EST, ASTRAL2, and ASTRID infer species trees from gene trees but suffer when gene tree accuracy is low. To address this, the Statistical Binning (SB) and Weighted Statistical Binning (WSB) pipelines were developed to improve gene tree estimation. However, previous studies only tested these pipelines using multi-locus bootstrapping (MLBS), not the BestML approach. This thesis proposes a novel pipeline, WSB+WQMC, which shares design features with the existing WSB+CAML pipeline but has other desirable properties and is statistically consistent under the GTR+MSC model. This study evaluated WSB+WQMC against WSB+CAML using BestML analysis on various simulated datasets. The results confirmed many trends seen in prior MLBS analyses. WSB+WQMC substantially improved gene tree and species tree accuracy (using ASTRAL2 and ASTRID) on most datasets with low, medium, and moderately high ILS levels. In a direct comparison, WSB+WQMC computed less accurate trees than WSB+CAML under certain low and medium ILS conditions. However, WSB+WQMC performed better or at least as accurately as WSB+CAML on all datasets with moderately high and high ILS. It also proved better for estimating gene trees on some medium and low ILS datasets. Thus, WSB+WQMC is a promising alternative to WSB+CAML for phylogenetic estimation, especially in the presence of low phylogenetic signal.

Paper Structure

This paper contains 44 sections, 25 figures, 4 tables.

Figures (25)

  • Figure 1: Basic unbinned phylogenetic pipeline using a summary method. The input to the pipeline is a set of sequences for different loci across different species. In this pipeline, a multiple sequence alignment is computed using any alignment method. Then a gene tree is computed on each gene using the multiple sequence alignment. Gene trees are then used by a summary method to compute a species tree.
  • Figure 2: WSB+CAML pipeline for phylogenetic analysis. The input to the pipeline is a set of gene alignments computed from a set of sequences for different loci across different species. Gene trees are then estimated on the gene alignments using a maximum likelihood tree estimation method. Gene trees are then used to compute an incompatibility graph, where each vertex represents a gene and each edge represents incompatibility between them based on binning threshold $t$. A heuristic for balanced minimum vertex coloring is run to divide the genes into disjoint bins. For each bin, gene alignments for genes within that bin are concatenated into a supergene alignment. Supergene trees are then estimated using a fully partitioned maximum likelihood gene tree estimation method. Supergene tree for each bin is repeated for as many genes in that bin and considered as the new gene tree. New gene trees are used by a summary method to compute a species tree. Please note that the example used to describe the WSB+CAML pipeline is taken from Figure 1 of bayzid2015weighted.
  • Figure 3: WSB+WQMC pipeline for phylogenetic analysis. The input to the pipeline is a set of gene alignments computed from a set of sequences for different loci across different species. Gene trees are then estimated on the gene alignments using a maximum likelihood tree estimation method. Then, genes are divided into disjoint bins using the binning technique used in WSB+CAML with binning threshold $t$. For each gene, a set of weighted quartets is computed by combining weighted quartet topologies within its bin with up-weighting own quartets by confidence value $c$. WQMC is then run with weighted quartet set as input to get a new gene tree for each gene. New gene trees are then used by a summary method to compute the species tree.
  • Figure 4: Gene tree estimation error on the Mammalian datasets for running WSB+WQMC with different confidence values and binning threshold 75%. We show results on the Mammalian-0.5X (50% AD), Mammalian-1X (30% AD), and Mammalian-2X (21% AD) datasets with 200 genes and 500bp alignment length. The x-axis shows three different ways of running WSB+WQMC by varying confidence value c (0.0, 0.2, and 0.3). The accuracy of WSB+WQMC gene trees is also compared to the accuracy of original gene trees. Average FN rate is shown with standard error bars over 10 replicates.
  • Figure 5: Gene tree estimation error on the Mammalian datasets for running WSB+WQMC with binning threshold 75% and binning threshold 100% using confidence value 0.2. We show results on the Mammalian-0.5X (50% AD), Mammalian-1X (30% AD), and Mammalian-2X (21% AD) datasets with 200 genes and 500bp alignment length. The accuracy of WSB+WQMC gene trees is also compared to the accuracy of original gene trees. Average FN rate is shown with standard error bars over 10 replicates.
  • ...and 20 more figures