Table of Contents
Fetching ...

Orthology and Near-Cographs in the Context of Phylogenetic Networks

Anna Lindeberg, Guillaume E. Scholz, Nicolas Wieseke, Marc Hellmuth

TL;DR

This paper addresses whether orthology graphs inferred without explicit gene or species trees can be explained by phylogenetic networks, focusing on the restrictive yet informative class of $level\text{-}1$ networks. Using modular decomposition and the $2$-$lca$ property, it characterizes level-$1$ explainable graphs as those whose every primitive subgraph is a near-cograph, and provides a linear-time algorithm to recognize such graphs and construct a $0/1$-labeled level-$1$ network that explains them. It then develops the prime-vertex replacement framework to systematically build level-$1$ networks from primitive subgraphs, establishing several equivalences and hereditary properties, and proves that Lev-$1$-Ex graphs are perfect and have twin-width at most $2$. The work lays a foundation for scalable analysis of network-based orthology, suggests generalizations to higher-level networks, and connects graph-theoretic concepts to practical evolutionary modeling, enabling efficient testing and construction of network explanations for biological data.

Abstract

Orthologous genes, which arise through speciation, play a key role in comparative genomics and functional inference. In particular, graph-based methods allow for the inference of orthology estimates without prior knowledge of the underlying gene or species trees. This results in orthology graphs, where each vertex represents a gene, and an edge exists between two vertices if the corresponding genes are estimated to be orthologs. Orthology graphs inferred under a tree-like evolutionary model must be cographs. However, real-world data often deviate from this property, either due to noise in the data, errors in inference methods or, simply, because evolution follows a network-like rather than a tree-like process. The latter, in particular, raises the question of whether and how orthology graphs can be derived from or, equivalently, are explained by phylogenetic networks. Here, we study the constraints imposed on orthology graphs when the underlying evolutionary history follows a phylogenetic network instead of a tree. We show that any orthology graph can be represented by a sufficiently complex level-k network. However, such networks lack biologically meaningful constraints. In contrast, level-1 networks provide a simpler explanation, and we establish characterizations for level-1 explainable orthology graphs, i.e., those derived from level-1 evolutionary histories. To this end, we employ modular decomposition, a classical technique for studying graph structures. Specifically, an arbitrary graph is level-1 explainable if and only if each primitive subgraph is a near-cograph (a graph in which the removal of a single vertex results in a cograph). Additionally, we present a linear-time algorithm to recognize level-1 explainable orthology graphs and to construct a level-1 network that explains them, if such a network exists.

Orthology and Near-Cographs in the Context of Phylogenetic Networks

TL;DR

This paper addresses whether orthology graphs inferred without explicit gene or species trees can be explained by phylogenetic networks, focusing on the restrictive yet informative class of networks. Using modular decomposition and the - property, it characterizes level- explainable graphs as those whose every primitive subgraph is a near-cograph, and provides a linear-time algorithm to recognize such graphs and construct a -labeled level- network that explains them. It then develops the prime-vertex replacement framework to systematically build level- networks from primitive subgraphs, establishing several equivalences and hereditary properties, and proves that Lev--Ex graphs are perfect and have twin-width at most . The work lays a foundation for scalable analysis of network-based orthology, suggests generalizations to higher-level networks, and connects graph-theoretic concepts to practical evolutionary modeling, enabling efficient testing and construction of network explanations for biological data.

Abstract

Orthologous genes, which arise through speciation, play a key role in comparative genomics and functional inference. In particular, graph-based methods allow for the inference of orthology estimates without prior knowledge of the underlying gene or species trees. This results in orthology graphs, where each vertex represents a gene, and an edge exists between two vertices if the corresponding genes are estimated to be orthologs. Orthology graphs inferred under a tree-like evolutionary model must be cographs. However, real-world data often deviate from this property, either due to noise in the data, errors in inference methods or, simply, because evolution follows a network-like rather than a tree-like process. The latter, in particular, raises the question of whether and how orthology graphs can be derived from or, equivalently, are explained by phylogenetic networks. Here, we study the constraints imposed on orthology graphs when the underlying evolutionary history follows a phylogenetic network instead of a tree. We show that any orthology graph can be represented by a sufficiently complex level-k network. However, such networks lack biologically meaningful constraints. In contrast, level-1 networks provide a simpler explanation, and we establish characterizations for level-1 explainable orthology graphs, i.e., those derived from level-1 evolutionary histories. To this end, we employ modular decomposition, a classical technique for studying graph structures. Specifically, an arbitrary graph is level-1 explainable if and only if each primitive subgraph is a near-cograph (a graph in which the removal of a single vertex results in a cograph). Additionally, we present a linear-time algorithm to recognize level-1 explainable orthology graphs and to construct a level-1 network that explains them, if such a network exists.

Paper Structure

This paper contains 16 sections, 38 theorems, 14 equations, 5 figures, 1 algorithm.

Key Result

Lemma 2.1

A clustering system $\mathfrak{C}$ is closed if and only if $A,B \in \mathfrak{C}$ and $A\cap B\ne\emptyset$ implies $A\cap B\in \mathfrak{C}$.

Figures (5)

  • Figure 1: Shown are four $0/1$-labeled DAGs $(N,t)$, $(N',t')$, $(N",t")$ and $(\widetilde{N}, \tilde{t})$ that all explain the graph $G$. Here, $N$, $N'$ and $N"$ are networks while $\widetilde{N}$ is not. Since $G\simeq P_4$, Theorem \ref{['thm:CharCograph']} implies that $G$ cannot be explained by a $0/1$-labeled tree. The network $(N,t)$ is a "half-grid" (cf. BSH:22) and is level-$3$, whereas $(N',t')$ is a regular level-2 network with $N'\doteq\mathscr{H}(\mathfrak{C}_G)$ where $\mathfrak{C}_G$ is chosen according to Equation \ref{['eq:C_G']}. The network $(N",t")$ is a level-1 network and coincide with $(\mathscr{T}_{\scaleto{\nwarrow}{4pt} c},\tau_{\scaleto{\nwarrow}{4pt} c})$ that is obtained from the cotree $(\mathscr{T},\tau)$ (shown in the upper right corner) of the cograph $G-c$ (cf. Definition \ref{['def:graft']} and Theorem \ref{['thm:N<-v_explains-G']}).
  • Figure 2: A $0/1$-labeled level-1 network $(N,t)$ where $G=\mathscr{G}(N,t)$ consists of $k>1$ vertex disjoint induced $P_4$s. If $G-Y$ is a cograph, then $Y\subseteq V(G)$ with $|Y|\geq k>1$ must hold.
  • Figure 3: Example for Definition \ref{['def:graft']}, constructing the shown $0/1$-labeled network $(N_{\scaleto{\nwarrow}{4pt} v},t_{\scaleto{\nwarrow}{4pt} v})$ from the level-1 network $(N,t)$ as shown in Figure \ref{['fig:level2counterex']}. The graph $G-v = \mathscr{G}(N,t)$ is explained by the level-1 network $(N,t)$, however, the network $(N_{\scaleto{\nwarrow}{4pt} v},t_{\scaleto{\nwarrow}{4pt} v})$ is level-$(k+1)$, $k>1$. Note that $G$ contains $k$ induced cycles $C_5$ on five vertices and is, by Lemma \ref{['lem:odd-hole-free']}, not Lev-1-Ex.
  • Figure 4: Shown is a graph $G$ that is explained by several $0/1$-labeled level-1 networks: $(N,t)$, $(N',t')$, and $(N",t")$. Note that $G$ is a near-cograph as $G-d$ is a cograph. Here, $(N,t)$ is a pvr-network obtained from the MDT $(\mathscr{T}_G,\tau_G)$ of $G$ by replacing the $\mathrm{prime} (\mathtt P)$-labeled vertex by a $0/1$-labeled level-1 network according to Definition \ref{['def:pvr']}. Moreover, according to Definition \ref{['def:graft']} and Equation \ref{['eq:mapN-NWA-z']}, $(N',t') = (\tilde{T}_{\scaleto{\nwarrow}{4pt} d}, \tilde{t}_{\scaleto{\nwarrow}{4pt} d})$ with $(\tilde{T}, \tilde{t})$ being the cotree of $G-d$. In contrast, the network $(N",t")$ is not obtained from the MDT of $G$ resp. cotree of $G-z$ in the manner as the other two networks. Since the cluster $\{y,z\} \in \mathfrak{C}_{N"}$ overlaps no cluster in $\mathfrak{C}_{N"}$, Corollary \ref{['cor:nonoverlap-module']} implies that $\{y,z\}$ is a module of $G$. The same is true for the set $\{x,y,z\}$, which is a (strong) module obtained as the union of clusters $\{x\} \cup \{y,z\}$, which does not overlap with any cluster in $\mathfrak{C}_{N"}$. According to Lemma \ref{['lem:block-module']}, another module of $G$ is given by $\mathop{\mathrm{L}}\nolimits(B) = \{a,b,c,d\} \subsetneq \mathop{\mathrm{\mathtt{C}}}\nolimits_{N"}(\max_B)$ for the non-trivial block $B$ of $N"$ (highlighted in gray-shaded area). Note that $\mathop{\mathrm{\mathtt{C}}}\nolimits_{N"}(\max_B) = \{a,b,c,d,h\} \in \mathfrak{C}_{N"}$ is not a module of $G$ as $h$ is adjacent to $x$ but none of the vertices $a,b,c,d$ is adjacent to $x$.
  • Figure 5: A primitive graph $G=\mathscr{G}(N,t)$ that is explained by the $0/1$-labeled level-2 network $(N,t)$. Here, there is no subset $Y\subseteq V(G)$ of size $|Y|\leq 2$ such that $G-Y$ is a cograph. To see this, assume, for contradiction, that there is a set $Y$ of size $|Y|\leq 2$ resulting in the cograph $G-Y$. First observe that there are two intertwined induced $P_4$s $G[\{b,a,f,j\}]$ and $G[\{a,f,j,i\}]$ that are both vertex-disjoint with the induced $P_4$$G[\{c,g,d,e\}]$. Hence, $Y$ must contain one vertex of $a,f,j$ and one vertex of $c,g,d,e$ to destroy these $P_4$. If $j\in Y$, then $Y = \{j,v\}$ with $v\in \{c,g,d,e\}$ would leave the induced $P_4$$G[\{b,a,f,h\}]$ in $G-Y$. Hence, $j\notin Y$. If $f\in Y$ and $Y = \{f,e\}$, then the induced $P_4$$G[\{b,a,g,j\}]$ is in $G-Y$. Thus, $Y\neq \{f,e\}$ and, if $f\in Y$ then, $Y = \{f,v\}$ with $v\in \{c,g,d\}$ must hold in which case $G-Y$ contains the induced $P_4$$G[\{a,e,j,i\}]$. Thus, $f\in Y$ is not possible. Thus, $a\in Y$ must hold. If $Y=\{a,g\}$ or $Y=\{a,c\}$, the induced $P_4$$G[\{d,e,j,i\}]$ remains in $G-Y$. If $Y=\{a,d\}$ or $Y=\{a,e\}$, the induced $P_4$$G[\{c,f,j,i\}]$ remains in $G-Y$. Thus, none of the combinations $Y=\{v,w\}$ with $v\in \{a,f,j\}$$w\in \{c,g,d,e\}$ yield a cograph. Thus, there is no set $Y$ of size $|Y|\leq 2$ resulting in the cograph $G-Y$.

Theorems & Definitions (70)

  • Lemma 2.1
  • Remark
  • Definition 2.2
  • Lemma 2.3: Shanavas2024
  • Definition 2.4
  • Lemma 2.5: Hellmuth2023
  • Definition 2.6
  • Lemma 2.7
  • Lemma 2.8: Hellmuth2023
  • Lemma 2.9
  • ...and 60 more