Table of Contents
Fetching ...

Convergence-divergence models: Generalizations of phylogenetic trees modeling gene flow over time

Jonathan D. Mitchell, Barbara R. Holland

TL;DR

This work introduces convergence-divergence models (CDMs) that generalize phylogenetic trees by allowing gene flow over time on a single principal tree, addressing processes like introgressive hybridization and replicated evolution. It develops maximum-likelihood, quartet-based algorithms to infer N-taxon CDMs—covering topology, convergence groups, and parameters—from diverse datasets such as MSAs and gene presence/absence data. The authors establish identifiability and consistency results for 4-taxon CDMs and devise divide-and-conquer strategies to scale to large taxa, including procedures to infer leaf-taxon distances and edge lengths under the Hadamard-basis parametrization. The framework provides a flexible, scalable alternative to phylogenetic networks, enabling explicit modeling of gradual gene flow and convergence across entire genomes and gene sets with practical applications in comparative genomics and evolutionary studies.

Abstract

Phylogenetic trees are simple models of evolutionary processes. They describe conditionally independent divergent evolution of taxa from common ancestors. Phylogenetic trees commonly do not have enough flexibility to adequately model all evolutionary processes. For example, introgressive hybridization, where genes can flow from one taxon to another. Phylogenetic networks model evolution not fully described by a phylogenetic tree. However, many phylogenetic network models assume ancestral taxa merge instantaneously to form ``hybrid'' descendant taxa. In contrast, our convergence-divergence models retain a single underlying ``principal'' tree, but permit gene flow over arbitrary time frames. Alternatively, convergence-divergence models can describe other biological processes leading to taxa becoming more similar over a time frame, such as replicated evolution. Here we present novel maximum likelihood-based algorithms to infer most aspects of $N$-taxon convergence-divergence models, many consistently, using a quartet-based approach. The algorithms can be applied to multiple sequence alignments restricted to genes or genomic windows or to gene presence/absence datasets.

Convergence-divergence models: Generalizations of phylogenetic trees modeling gene flow over time

TL;DR

This work introduces convergence-divergence models (CDMs) that generalize phylogenetic trees by allowing gene flow over time on a single principal tree, addressing processes like introgressive hybridization and replicated evolution. It develops maximum-likelihood, quartet-based algorithms to infer N-taxon CDMs—covering topology, convergence groups, and parameters—from diverse datasets such as MSAs and gene presence/absence data. The authors establish identifiability and consistency results for 4-taxon CDMs and devise divide-and-conquer strategies to scale to large taxa, including procedures to infer leaf-taxon distances and edge lengths under the Hadamard-basis parametrization. The framework provides a flexible, scalable alternative to phylogenetic networks, enabling explicit modeling of gradual gene flow and convergence across entire genomes and gene sets with practical applications in comparative genomics and evolutionary studies.

Abstract

Phylogenetic trees are simple models of evolutionary processes. They describe conditionally independent divergent evolution of taxa from common ancestors. Phylogenetic trees commonly do not have enough flexibility to adequately model all evolutionary processes. For example, introgressive hybridization, where genes can flow from one taxon to another. Phylogenetic networks model evolution not fully described by a phylogenetic tree. However, many phylogenetic network models assume ancestral taxa merge instantaneously to form ``hybrid'' descendant taxa. In contrast, our convergence-divergence models retain a single underlying ``principal'' tree, but permit gene flow over arbitrary time frames. Alternatively, convergence-divergence models can describe other biological processes leading to taxa becoming more similar over a time frame, such as replicated evolution. Here we present novel maximum likelihood-based algorithms to infer most aspects of -taxon convergence-divergence models, many consistently, using a quartet-based approach. The algorithms can be applied to multiple sequence alignments restricted to genes or genomic windows or to gene presence/absence datasets.

Paper Structure

This paper contains 35 sections, 34 theorems, 124 equations, 9 figures, 5 algorithms.

Key Result

Proposition 1

Suppose a tip epoch of CDM $\mathcal{N}$ with leaf taxon set $X$ and $\left|X\right|=N$ corresponds to a set of sets of taxa in each convergence-divergence group $\mathcal{C}=\left\{C_1,C_2,\ldots{}C_k\right\}$. Suppose $\boldsymbol{Q}^{\left[\mathcal{C}\right]}$ is the $2^N\times2^N$ rate matrix re where $\alpha_r,\beta_r>0$.

Figures (9)

  • Figure 1: Two representations (a, b) of a phylogenetic tree with equivalent probability distributions at the leaves. $\delta$ is the splitting operator representing speciation events. (a) Splitting operators have not been pushed back. (b) Splitting operators have been pushed back. Parallel edges separated by small gaps are identical edges. (c) The rate matrix that keeps two identical edges identical models convergence between two diverged edges, represented by the two curved edges
  • Figure 2: The five $4$-taxon CDMs meeting assumptions of Section \ref{['ass']} before considering leaf labeling and parameter values. Convergence is drawn as curves. Epochs are separated by events represented by dashed lines on CDM $5$. For each epoch the corresponding partition or decorated partition is on the left. Epoch intervals are on the right. Parameters are labeled on sections of the edges of CDM $5$
  • Figure 3: The process of inferring an $N$-taxon CDM from an empirical dataset. All $4$-taxon trees and CDMs that include the outgroup are considered
  • Figure 4: $4$-taxon CDMs $\mathcal{N}_1$, $\mathcal{N}_2$ and $\mathcal{N}_3$, with identical sets of possible phylogenetic tensors in the limit that epoch lengths labeled $0$ and $\infty$ converge or diverge to $0$ or $\infty$
  • Figure 5: Distinct $5$-taxon CDMs $\mathcal{N}_1$ and $\mathcal{N}_2$ with the same topology of the principal tree and matrix of proportions of converging quartets
  • ...and 4 more figures

Theorems & Definitions (56)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Proposition 1
  • Theorem 2
  • Proposition 3
  • Definition 5
  • Proposition 4
  • proof
  • ...and 46 more