Table of Contents
Fetching ...

When is String Reconstruction using de Bruijn Graphs Hard?

Ben Bals, Sebastiaan van Krieken, Solon P. Pissis, Leen Stougie, Hilde Verbeek

TL;DR

This work develops an algorithm parametrization aiming to capture the quality of the authors' domain knowledge in the complexity and gives combinatorial insights that lead to exponential-time improvements over the state-of-the-art.

Abstract

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph $G=(V,E)$ of order $k$ over an alphabet $Σ$. A single Eulerian trail in $G$ represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on $z$-anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length-$k$ string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function $c$ mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance $(G,c)$, can we efficiently compute an Eulerian trail respecting $c$? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in $O(m \cdot w^{1.5} 4^{w})$ time, where $m=|E|$ and $w$ is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by $w (\log w+1) /(k-1)$. Our improved algorithm shows that it is enough when the range of positions is small relative to $k$.

When is String Reconstruction using de Bruijn Graphs Hard?

TL;DR

This work develops an algorithm parametrization aiming to capture the quality of the authors' domain knowledge in the complexity and gives combinatorial insights that lead to exponential-time improvements over the state-of-the-art.

Abstract

The reduction of the fragment assembly problem to (variations of) the classical Eulerian trail problem [Pevzner et al., PNAS 2001] has led to remarkable progress in genome assembly. This reduction employs the notion of de Bruijn graph of order over an alphabet . A single Eulerian trail in represents a candidate genome reconstruction. Bernardini et al. have also introduced the complementary idea in data privacy [ALENEX 2020] based on -anonymity. The pressing question is: How hard is it to reconstruct a best string from a de Bruijn graph given a function that models domain knowledge? Such a function maps every length- string to an interval of positions where it may occur in the reconstructed string. By the above reduction to de Bruijn graphs, the latter function translates into a function mapping every edge to an interval where it may occur in an Eulerian trail. This gives rise to the following basic problem on graphs: Given an instance , can we efficiently compute an Eulerian trail respecting ? Hannenhalli et al.~[CABIOS 1996] formalized this problem and showed that it is NP-complete. We focus on parametrization aiming to capture the quality of our domain knowledge in the complexity. Ben-Dor et al. developed an algorithm to solve the problem on de Bruijn graphs in time, where and is the maximum interval length over all edges. Bumpus and Meeks [Algorithmica 2023] rediscovered the same algorithm on temporal graphs, highlighting the relevance of this problem in other contexts. We give combinatorial insights that lead to exponential-time improvements over the state-of-the-art. For the important class of de Bruijn graphs, we develop an algorithm parametrized by . Our improved algorithm shows that it is enough when the range of positions is small relative to .

Paper Structure

This paper contains 20 sections, 28 theorems, 7 equations, 6 figures.

Key Result

Theorem 1.1

The $\mathrm{diET}\xspace$ problem is $\mathsf{NP}$-complete, even on de Bruijn graphs with $|\Sigma|=2$.

Figures (6)

  • Figure 1: The de Bruijn multigraph $G_{\mathcal{S},k}=(V,E)$ (left), the set of node-distinct Eulerian trails from $s$ to $t$ (middle), and the corresponding set of string reconstructions (right) for the string collection $\mathcal{S}= \texttt{001}, \texttt{010}, \texttt{011}, \texttt{011}, \texttt{100}, \texttt{101}, \texttt{110}, \texttt{110}$, over the alphabet $\Sigma=\{\texttt{0},\texttt{1}\}$, and $k=3$.
  • Figure 2: On the left is the input graph $G$: every edge is labeled with the time steps at which it is available. On the right is a table illustrating the interval every edge is available: $abcdac$ is an Eulerian trail; the edge usages corresponding to this trail are indicated with red vertical lines.
  • Figure 3: A directed graph $G$ (left) and the corresponding part of graph $G'$ (right). The nodes in $G'$ that correspond to nodes in $G$ are colored black; the nodes in $G'$ that correspond to edges in $G$ are colored red. Inner edges in $G'$ are colored gray and squiggly lines indicate paths of length $2\ell +4$.
  • Figure 4: The zero-cost cycle $Q_1 \subseteq E'$ constructed from a Hamiltonian path in $G$.
  • Figure 5: The input de Bruijn graph $G$ (on the left) and the construction of graph $H$ (on the right) for $k=3$, $|\Sigma|=2$, $w=2$, and thus $w+k-1=4$. For instance, for $t=2$, $\ell=2$, and $i=2$, we add node $(t, \alpha)=(2,\texttt{ATAA})$ to layer $t=2$ in $H$, because the node of $G$ corresponding to $\alpha[(i-1)(k-1)+1\mathinner{.\,.} i(k-1)]=\texttt{AA}$ has an incoming edge labeled $\alpha[i(k-1)]=\texttt{A}$ available at time step $t-w+(i-1)(k-1)=2$. A $c$-respecting Eulerian trail in $G$ is indicated in red in $H$.
  • ...and 1 more figures

Theorems & Definitions (54)

  • Theorem 1.1
  • Theorem 1.1: Theorem 11 of DBLP:journals/jcb/Ben-DorPSS02, Theorem 7 of DBLP:journals/algorithmica/BumpusM23
  • Theorem 1.1
  • Corollary 1.0
  • Corollary 1.0
  • Theorem 1.1
  • Definition 2.1
  • Definition 2.2
  • Theorem 3.1
  • Lemma 3.2
  • ...and 44 more