Ultrabubble enumeration via a lowest common ancestor approach

Athanasios E. Zisis; Pål Sætrom

Ultrabubble enumeration via a lowest common ancestor approach

Athanasios E. Zisis, Pål Sætrom

TL;DR

It is shown that any bidirected graph can be transformed to a bipartite biedged graph in which lowest common ancestor queries can determine whether a snarl is an ultrabubble, leading to an O(Kn) algorithm for finding all ultrabubbles in a set of K snarls, improving on the prior naive approach.

Abstract

Pangenomics uses graph-based models to represent and study the genetic variation between individuals of the same species or between different species. In such variation graphs, a path through the graph represents one individual genome. Subgraphs that encode locally distinct paths are therefore genomic regions with distinct genetic variation and detecting such subgraphs is integral for studying genetic variation. Biedged graphs is a type of variation graph that use two types of edges, black and grey, to represent genomic sequences and adjacencies between sequences, respectively. Ultrabubbles in biedged graphs are minimal subgraphs that represent a finite set of sequence variants that all start and end with two distinct sequences; that is, ultrabubbles are acyclic and all paths in an ultrabubble enter and exit through two distinct black edges. Ultrabubbles are therefore a special case of snarls, which are minimal subgraphs that are connected with two black edges to the rest of the graph. Here, we show that any bidirected graph can be transformed to a bipartite biedged graph in which lowest common ancestor queries can determine whether a snarl is an ultrabubble. This leads to an O(Kn) algorithm for finding all ultrabubbles in a set of K snarls, improving on the prior naive approach of O(K(n + m)) in a biedged graph with n nodes and m edges. Accordingly, our benchmark experiments on real and synthetic variation graphs show improved run times on graphs with few cycles and dead end paths, and dense graphs with many edges.

Ultrabubble enumeration via a lowest common ancestor approach

TL;DR

Abstract

Paper Structure (18 sections, 10 theorems, 7 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 10 theorems, 7 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
Algorithms
Ultrabubbles and snarls by the naive approach
Ultrabubbles and snarls by lowest common ancestor queries
Materials and methods
Graphs
Graph preprocessing
Identifying snarls
Identifying ultrabubbles
Software
Results
Discussion and conclusions
Supplementary Material
Building from a bidirected graph in gfa format, a biedged graph isomorphic to a directed graph
...and 3 more sections

Key Result

Corollary 1

A snarl defined by an $R$ and $L$ node cannot have frontiers $sn_1$ and $sn_2$ with equal distances from the root and thus we can always sort-characterize correctly and uniquely such a snarl as $R-L$ or $L-R$ depending on which of the two frontier nodes are closer to the root.

Figures (7)

Figure 1: Snarls that (A-B) cannot and (C-D) can form ultrabubbles in biedged graphs. Snarls are colored yellow or orange; red crosses indicate the corresponding removed black edges. (A) Any $L-L$ (left-left) or $R-R$ (right-right) snarl cannot form an ultrabubble because it belongs to the component $C_{\text{start}}$ or $C_{\text{end}}$, respectively. (B) Any $L-R$ snarl cannot form an ultrabubble because the $L$ node, which is part of the snarl, will be in the $C_{\text{start}}$ component. (C) Any $R-L$ snarl partitions the graph into three or two components. The former happens only if there are no grey edges connecting $C_{\text{start}}$ and $C_{\text{end}}$, in which case the removed black edges are bridge edges. In the latter case, at least one such grey edge exist and then $C_{\text{start}}$=$C_{\text{end}}$. (D) An $R-L$ snarl connected by a single grey edge is both a trivial snarl and a trivial ultrabubble.
Figure 2: Cycles that are (A-C) incompatible and (D) compatible with snarls. (A) The frontier nodes $sn_1$, $sn_2$ of an $R-L$ snarl cannot be cycle closing nodes. By definition, $sn_1$ cannot have incoming grey edges. For $sn_2$, an incoming cycle closing grey edge would violate the snarl definition, since it would place $sn_2$ and $sn_2'$ (here, $(i+7)\_L$ and $(i+7)\_R$) in the same component. (B-C) An internal node of an $R-L$ snarl cannot (B) be a cycle closing node of a cycle that belongs partially inside and outside the snarl, or (C) have a grey edge connected to a node outside of the snarl, as this would violate the separation criterion of the snarl definition. (D) Cycle closing nodes in an $R-L$ snarl that exist between the frontier nodes $sn_1$ and $sn_2$ of the snarl represent a cycle nested in the snarl's frontiers. Note that the identified cycle closing nodes will depend on the DFS execution; here, either $(i+4)\_L$ or $(i+6)\_L$ represent the same nested cycle.
Figure 3: Lowest common ancestor (LCA) queries identify whether a node is located within an $R-L$ snarl (yellow). (A) For the three tips (green) $(t-1)\_R$, $t\_R$, and $(t+1)\_R$ located, respectively, to the left, within, and to the right of the snarl, the following hold: $LCA((t-1)\_R, sn_1)\neq sn_1$, $LCA(t\_R, sn_1)=sn_1$, $LCA(t\_R, sn_2)\neq sn_2$, and $LCA((t+1)\_R, sn_2)= sn_2$. (B) The same set of LCA checks apply to both tips (green) and cycle closing nodes (blue).
Figure 4: Lowest common ancestor (LCA) queries in a breadth first search (BFS) tree ($LCA_{B_t}$) identify whether a node is located within an $R-L$ snarl (yellow). (A)$LCA_{B_t}$ can give a result that is closer to the root than that of the LCA in the biedged graph ($LCA_B$); for example, $LCA_{B_t}((i+3)\_L, (i+6)\_L)=(i+1)\_R$, whereas $LCA_B((i+3)\_L, (i+6)\_L)=(i+2)\_R$. (B)$LCA_{B_t}$ can never give an answer higher than $sn_1$, as such an answer would violate the separation criterion of the snarl.
Figure 5: A)A graph that has not a working root is illustrated , which means the presence of cycle structures or candidate roots that cannot reach all nodes of the graph, or a combination. Note that both $(i+1)\_L$ and $n\_R$ are tips but the first one is a candidate root since it has in-degree=0 while the second is a dead-end since it has out-degree=0. B) In the biedged graph the strongly connected components SCC are computed and the condensation graph is build having each SCC as supernode. Then the candidate roots are the supernodes like $1D\_L$ and other normal nodes like $(i+1)\_L$ both having in-degree=0. C) An artificial black edge with nodes $00\_L$,$00\_R$ is added and $00\_R$ is connected with grey edges to every candidate root of the condensation graph with the procees expalined about the respective SCC of the supernode of the condensation graph. That makes $00\_L$ a working root for the graph.Note that by this process the snarl set of the graph might alter, like snarl ($i\_R$, $(i+4)\_L$), is not any more a snarl, but all ultrabubbles, like ($1\_R$, $4\_L$), and their frontier nodes remain unaffected.
...and 2 more figures

Theorems & Definitions (17)

Corollary 1
Corollary 2
Definition 1
Lemma 1
Theorem 1
proof
Lemma 2
proof
Theorem 2
proof
...and 7 more

Ultrabubble enumeration via a lowest common ancestor approach

TL;DR

Abstract

Ultrabubble enumeration via a lowest common ancestor approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)