Convergence-divergence models: Generalizations of phylogenetic trees modeling gene flow over time
Jonathan D. Mitchell, Barbara R. Holland
TL;DR
This work introduces convergence-divergence models (CDMs) that generalize phylogenetic trees by allowing gene flow over time on a single principal tree, addressing processes like introgressive hybridization and replicated evolution. It develops maximum-likelihood, quartet-based algorithms to infer N-taxon CDMs—covering topology, convergence groups, and parameters—from diverse datasets such as MSAs and gene presence/absence data. The authors establish identifiability and consistency results for 4-taxon CDMs and devise divide-and-conquer strategies to scale to large taxa, including procedures to infer leaf-taxon distances and edge lengths under the Hadamard-basis parametrization. The framework provides a flexible, scalable alternative to phylogenetic networks, enabling explicit modeling of gradual gene flow and convergence across entire genomes and gene sets with practical applications in comparative genomics and evolutionary studies.
Abstract
Phylogenetic trees are simple models of evolutionary processes. They describe conditionally independent divergent evolution of taxa from common ancestors. Phylogenetic trees commonly do not have enough flexibility to adequately model all evolutionary processes. For example, introgressive hybridization, where genes can flow from one taxon to another. Phylogenetic networks model evolution not fully described by a phylogenetic tree. However, many phylogenetic network models assume ancestral taxa merge instantaneously to form ``hybrid'' descendant taxa. In contrast, our convergence-divergence models retain a single underlying ``principal'' tree, but permit gene flow over arbitrary time frames. Alternatively, convergence-divergence models can describe other biological processes leading to taxa becoming more similar over a time frame, such as replicated evolution. Here we present novel maximum likelihood-based algorithms to infer most aspects of $N$-taxon convergence-divergence models, many consistently, using a quartet-based approach. The algorithms can be applied to multiple sequence alignments restricted to genes or genomic windows or to gene presence/absence datasets.
