Bounds on the sequence length sufficient to reconstruct level-1 phylogenetic networks
Martin Frohn, Niels Holtgrefe, Leo van Iersel, Mark Jones, Steven Kelk
TL;DR
This work establishes data-sufficiency bounds for reconstructing binary semi-directed level-1 phylogenetic networks under the CFN model. It develops a distance-based reconstruction framework built on quartet-profile inference rules and a dyadic-closure approach, introducing the DCNC and L1QPC components and proving that high-probability recovery requires sequence length that scales with the number of taxa according to the network's depth, cycle length, and mutation-hierarchy parameters. By proving linear-size encoding via representative quartet profiles and providing polynomial-time reconstruction, the study bridges convergence results from tree inference to network inference, with corollaries showing linear-sized quartet/quarnet sets suffice for full reconstruction. The findings offer practical guidance on how much genomic data is needed in network-aware evolutionary analyses and pave the way for extending these methods to more general network classes and substitution models.
Abstract
Phylogenetic trees and networks are graphs used to model evolutionary relationships, with trees representing strictly branching histories and networks allowing for events in which lineages merge, called reticulation events. While the question of data sufficiency has been studied extensively in the context of trees, it remains largely unexplored for networks. In this work we take a first step in this direction by establishing bounds on the amount of genomic data required to reconstruct binary level-$1$ semi-directed phylogenetic networks, which are binary networks in which reticulation events are indicated by directed edges, all other edges are undirected, and cycles are vertex-disjoint. For this class, methods have been developed recently that are statistically consistent. Roughly speaking, such methods are guaranteed to reconstruct the correct network assuming infinitely long genomic sequences. Here we consider the question whether networks from this class can be uniquely and correctly reconstructed from finite sequences. Specifically, we present an inference algorithm that takes as input genetic sequence data, and demonstrate that the sequence length sufficient to reconstruct the correct network with high probability, under the Cavender-Farris-Neyman model of evolution, scales logarithmically, polynomially, or polylogarithmically with the number of taxa, depending on the parameter regime. As part of our contribution, we also present novel inference rules for quartet data in the semi-directed phylogenetic network setting.
