Table of Contents
Fetching ...

Simplifying and Characterizing DAGs and Phylogenetic Networks via Least Common Ancestor Constraints

Anna Lindeberg, Marc Hellmuth

TL;DR

This work develops a rigorous framework to simplify directed acyclic graphs modeling evolutionary histories by retaining only vertices that are supported as least common ancestors. Central to the approach is the simple $unomius$-operator, which collapses non-LCA vertices while preserving ancestral relationships and clustering structure, yielding $ ext{LCA}$-Rel and $ ext{lca}$-Rel DAGs. The authors provide comprehensive characterizations, linear-time LCA computation for small sets, and polynomial-time transformations to reach LCA-relations, with strong ties to regular DAGs and PCC/CL properties. They also map the computational complexity landscape, showing NP-hardness in general but tractability under the (N3O) cluster constraint, which encompasses important classes like rooted trees and galled-trees. The results culminate in a practical, verifiable framework (with the SimpliDAG tool) for producing phylogenetically interpretable networks that preserve key data-supported structure while enabling scalable analysis and comparison across models.

Abstract

Rooted phylogenetic networks, or more generally, directed acyclic graphs (DAGs), are widely used to model species or gene relationships that traditional rooted trees cannot fully capture, especially in the presence of reticulate processes or horizontal gene transfers. Such networks or DAGs are typically inferred from observable data (e.g. genomic sequences of extant species), providing only an estimate of the true evolutionary history. However, these inferred DAGs are often complex and difficult to interpret. In particular, many contain vertices that do not serve as least common ancestors (LCAs) for any subset of the underlying genes or species, thus may lack direct support from the observable data. In contrast, LCA vertices are witnessed by historical traces justifying their existence and thus represent ancestral states substantiated by the data. To reduce unnecessary complexity and eliminate unsupported vertices, we aim to simplify a DAG to retain only LCA vertices while preserving essential evolutionary information. In this paper, we characterize $\mathrm{LCA}$-relevant and $\mathrm{lca}$-relevant DAGs, defined as those in which every vertex serves as an LCA (or unique LCA) for some subset of taxa. We introduce methods to identify LCAs in DAGs and efficiently transform any DAG into an $\mathrm{LCA}$-relevant or $\mathrm{lca}$-relevant one while preserving key structural properties of the original DAG or network. This transformation is achieved using a simple operator ``$\ominus$'' that mimics vertex suppression.

Simplifying and Characterizing DAGs and Phylogenetic Networks via Least Common Ancestor Constraints

TL;DR

This work develops a rigorous framework to simplify directed acyclic graphs modeling evolutionary histories by retaining only vertices that are supported as least common ancestors. Central to the approach is the simple -operator, which collapses non-LCA vertices while preserving ancestral relationships and clustering structure, yielding -Rel and -Rel DAGs. The authors provide comprehensive characterizations, linear-time LCA computation for small sets, and polynomial-time transformations to reach LCA-relations, with strong ties to regular DAGs and PCC/CL properties. They also map the computational complexity landscape, showing NP-hardness in general but tractability under the (N3O) cluster constraint, which encompasses important classes like rooted trees and galled-trees. The results culminate in a practical, verifiable framework (with the SimpliDAG tool) for producing phylogenetically interpretable networks that preserve key data-supported structure while enabling scalable analysis and comparison across models.

Abstract

Rooted phylogenetic networks, or more generally, directed acyclic graphs (DAGs), are widely used to model species or gene relationships that traditional rooted trees cannot fully capture, especially in the presence of reticulate processes or horizontal gene transfers. Such networks or DAGs are typically inferred from observable data (e.g. genomic sequences of extant species), providing only an estimate of the true evolutionary history. However, these inferred DAGs are often complex and difficult to interpret. In particular, many contain vertices that do not serve as least common ancestors (LCAs) for any subset of the underlying genes or species, thus may lack direct support from the observable data. In contrast, LCA vertices are witnessed by historical traces justifying their existence and thus represent ancestral states substantiated by the data. To reduce unnecessary complexity and eliminate unsupported vertices, we aim to simplify a DAG to retain only LCA vertices while preserving essential evolutionary information. In this paper, we characterize -relevant and -relevant DAGs, defined as those in which every vertex serves as an LCA (or unique LCA) for some subset of taxa. We introduce methods to identify LCAs in DAGs and efficiently transform any DAG into an -relevant or -relevant one while preserving key structural properties of the original DAG or network. This transformation is achieved using a simple operator ``'' that mimics vertex suppression.

Paper Structure

This paper contains 13 sections, 39 theorems, 13 equations, 11 figures, 1 table, 3 algorithms.

Key Result

Lemma 2.1

For all DAGs $G$ and all $u,v\in V(G)$ it holds that $u\preceq_G v$ implies $\mathop{\mathrm{\mathtt{C}}}\nolimits_G(u)\subseteq\mathop{\mathrm{\mathtt{C}}}\nolimits_G(v)$.

Figures (11)

  • Figure 1: Shown are three networks $N$, $N'$ and $T$. All have the same clustering system $\mathfrak{C} = \{\{x\},\{y\},\{z\},\{x,y\},\{x,y,z\}\}$ and leaf set $X = \{x,y,z\}$. Here, only $N'$ and $T$ are phylogenetic. The network $N$ is not phylogenetic, since $N$ contains the vertex $u'$ with in- and out-degree one. Moreover, vertices $u$, $u'$ and $u"$ in $N$ are not LCAs of any subset of leaves. "Removing" $u$, $u'$ and $u"$ from $N$ via the "$\ominus$"-operator -- as explained in detail in Section \ref{['sec:ominus-lcaRel']} -- yields the simplified network $N' = N\ominus \{u,u',u"\}$ in which all vertices are LCAs of some subset of $X$. Hence, $N'$ is $\operatorname{LCA}$-Rel. In particular, $N'$ is precisely the simplification $\varphi_{\operatorname{LCA}}(N)$ as explained in Section \ref{['sec:simplify']}. However, $N'$ is not $\operatorname{lca}$-Rel as the vertices $v$ and $w$ are not unique LCAs in $N'$ for any subset of $X$. If desired, $N'$ can now be further simplified by "removing" one of $v$ or $w$ and resulting shortcuts which yields the phylogenetic and $\operatorname{lca}$-Rel tree $T$. The tree $T$ is the unique phylogenetic tree whose clustering system $\mathfrak{C}$ is identical to those in $N$ and $N'$.
  • Figure 2: Shown are two phylogenetic networks $N_1$ and $N_2$ and a phylogenetic DAG $G$ such that $\mathfrak{C}_{N_1} = \mathfrak{C}_{N_2} = \mathfrak{C}_G$. The clusters $\mathop{\mathrm{\mathtt{C}}}\nolimits(v)$ are drawn next to each individual vertex $v$ and highlighted by blue text. Out of the shown DAGs, only $N_1$ satisfies (PCC), is regular and has the strong-(CL) property (i.e., $v = \operatorname{lca}_{N_1}(\mathop{\mathrm{\mathtt{C}}}\nolimits_{N_1}(v))$ for all $v$ in $N_1$; cf. Def. \ref{['def:CLstrongCL']}). Moreover, only $N_1$ is $\operatorname{lca}$-Rel and $\operatorname{LCA}$-Rel (cf. Def. \ref{['def:LCAlcarel']}). Here, $G$ is $\operatorname{LCA}$-Rel but $N_2$ is not.
  • Figure 3: Shown are four phylogenetic networks $N_1$, $N_2$, $N_3$ and $N_4$ with the same set of leaves. Here, $N_1$ and $N_2 =N_1\ominus u$ are regular networks. The networks $N_3$ and $N_4$ only differ from $N_1$ by one edge each, as highlighted by dashed lines. Each inner vertex $v$ of these networks with $\mathop{\mathrm{\mathtt{C}}}\nolimits_{N_i}(v)\neq \{a,b,c\}$ is a $2$-$\operatorname{lca}$-vertex. In $N_1$, the vertex $u$ with cluster $\mathop{\mathrm{\mathtt{C}}}\nolimits_{N_1}(u)=\{a,b,c\}$ is not a $2$-$\operatorname{lca}$ vertex, but a $3$-$\operatorname{lca}$ vertex. Consequently, $N_1$ is a $\{1,2,3\}$-$\operatorname{lca}$-Rel network but not $\{1,2\}$-$\operatorname{lca}$-Rel. One may also verify that the same holds for the network $N_3$ but that $N_2$ is $\{1,2\}$-$\operatorname{lca}$-Rel. For $N_4$ we can apply Lemma \ref{['lem:not_kLCA']} to the edge ($u,u')$ connecting the vertices $u$ and $u'$ for which $\mathop{\mathrm{\mathtt{C}}}\nolimits_{N_4}(u)=\{a,b,c\}=\mathop{\mathrm{\mathtt{C}}}\nolimits_{N_4}(u')$ holds and conclude that $N_4$ is not $\operatorname{LCA}$-Rel and, therefore, not $\operatorname{lca}$-Rel. In particular, the vertex $u$ in $N_4$ is not the LCA of any subset of leaves.
  • Figure 4: Shown are three networks $N_1$, $N_2$ and $N_3$ having the same clustering system $\mathfrak{C} = \{\{x\}, \{y\}, \{x,y\}\}$. The network $N_1$ has the (CL) but not the strong-(CL) property. The network $N_2$ has the strong-(CL) and, thus, also the (CL) property. The network $N_3$ has neither the strong-(CL) nor the (CL) property.
  • Figure 5: The network $G$ is neither $\operatorname{lca}$-Rel nor $\operatorname{LCA}$-Rel, since none of the vertices $v$, $w$ and $\rho$ in $G$ are $\{1,2\}$-$\operatorname{lca}$ vertices. Since $\rho$ is the only vertex that is not a $\{1,2\}$-$\operatorname{LCA}$, $G\ominus \rho$ is $\operatorname{LCA}$-Rel. The set $W=\{\rho,v,w\}$ is the set of all vertices that are not $\{1,2\}$-$\operatorname{lca}$ vertices. The stepwise computation of $G\ominus v$, $(G\ominus v)\ominus \rho$ and $G\ominus W$ is shown in the lower part and results in a disconnected DAG. However, Algorithm \ref{['alg:lca-relevant']} determines whether a vertex is a $\{1,2\}$-$\operatorname{lca}$ vertex in the updated DAG. Hence, if we start with $v$ to obtain $G\ominus v$, there is only one vertex left that is not an $\{1,2\}$-$\operatorname{lca}$ vertex, namely $\rho$. In $(G\ominus v)\ominus \rho$ each vertex is a $\{1,2\}$-$\operatorname{lca}$ vertex and the algorithm terminates.
  • ...and 6 more figures

Theorems & Definitions (77)

  • Lemma 2.1: S+24
  • Lemma 2.2
  • proof
  • Definition 2.3
  • Definition 2.4
  • Lemma 2.5
  • proof
  • Definition 2.6: Baroni:05
  • Lemma 3.1
  • proof
  • ...and 67 more