Table of Contents
Fetching ...

Elastic-Degenerate String Comparison

Esteban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, Wiktor Zuba

TL;DR

The basic question here is how fast can the authors check whether the two languages they represent have a nonempty intersection, and it is shown that the problem is NP-complete.

Abstract

An elastic-degenerate (ED) string $T$ is a sequence of $n$ sets $T[1],\ldots,T[n]$ containing $m$ strings in total whose cumulative length is $N$. We call $n$, $m$, and $N$ the length, the cardinality and the size of $T$, respectively. The language of $T$ is defined as $L(T)=\{S_1 \cdots S_n\,:\,S_i \in T[i]$ for all $i\in[1,n]\}$. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem.For two ED strings $T_1$ and $T_2$ of lengths $n_1$ and $n_2$, cardinalities $m_1$ and $m_2$, and sizes $N_1$ and $N_2$, respectively, we show the following: - There is no $O((N_1N_2)^{1-ε})$-time algorithm, for any constant $ε>0$, for EDSI even when $T_1$ and $T_2$ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. - There is no combinatorial $O((N_1+N_2)^{1.2-ε}f(n_1,n_2))$-time algorithm, for any constant $ε>0$ and any function $f$, for EDSI even when $T_1$ and $T_2$ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. - An $O(N_1\log N_1\log n_1+N_2\log N_2\log n_2)$-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when $T_1$ and $T_2$ are given in a compact representation, we show that the problem is NP-complete. - An $O(N_1m_2+N_2m_1)$-time algorithm for EDSI. - An $\tilde{O}(N_1^{ω-1}n_2+N_2^{ω-1}n_1)$-time algorithm for EDSI, where $ω$ is the exponent of matrix multiplication; the $\tilde{O}$ notation suppresses factors that are polylogarithmic in the input size.

Elastic-Degenerate String Comparison

TL;DR

The basic question here is how fast can the authors check whether the two languages they represent have a nonempty intersection, and it is shown that the problem is NP-complete.

Abstract

An elastic-degenerate (ED) string is a sequence of sets containing strings in total whose cumulative length is . We call , , and the length, the cardinality and the size of , respectively. The language of is defined as for all . ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem.For two ED strings and of lengths and , cardinalities and , and sizes and , respectively, we show the following: - There is no -time algorithm, for any constant , for EDSI even when and are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. - There is no combinatorial -time algorithm, for any constant and any function , for EDSI even when and are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. - An -time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when and are given in a compact representation, we show that the problem is NP-complete. - An -time algorithm for EDSI. - An -time algorithm for EDSI, where is the exponent of matrix multiplication; the notation suppresses factors that are polylogarithmic in the input size.

Paper Structure

This paper contains 20 sections, 26 theorems, 6 equations, 9 figures, 1 algorithm.

Key Result

Theorem 2.2

Given any set $V=\{v^1,\ldots, v^k\}$ of $k$ binary vectors of length $d$, we can construct in linear time two ED strings $T_1$ and $T_2$ over a binary alphabet such that:

Figures (9)

  • Figure 1: An example of an MSA and its corresponding (non-unique) ED string $T$ of length $n=7$, cardinality $m=11$ and size $N=20$, and the compacted NFA for $T$. The compacted NFA can be seen as a special case of an edge-labeled directed acyclic graph.
  • Figure 2: An example of two ED strings $T_1$ and $T_2$ with their parameters and the intersection of their languages. In this instance, we see that $\mathcal{L}(T_1)$ and $\mathcal{L}(T_2)$ have a nonempty intersection.
  • Figure 3: On the left: an ED string; on the right: the corresponding path-automaton.
  • Figure 4: The path-automata $A_1$ and $A_2$ for ED strings $T_1= \left\{\texttt{AC}\texttt{A}\texttt{TGCT}\right\} \cdot \left\{\varepsilon\texttt{CA}\right\}$ and $T_2= \left\{\texttt{T}\varepsilon\right\} \cdot \left\{\texttt{GCA}\texttt{AC}\right\}$. The filled black nodes are explicit states, while the orange empty nodes are implicit states.
  • Figure 5: Intersection automaton for $T_1$ and $T_2$ as in Figure \ref{['fig:auto']} where the string $\texttt{AC}$ in $\mathcal{L}(T_1) \cap \mathcal{L}(T_2)$ that determines a positive answer to the EDSI can be spelled in the path from the starting state to the accepting state. The path-automata $A_1$ and $A_2$ are shown on the left and on the top, respectively, and nodes of the intersection automaton are arranged along dotted lines that correspond to copies of the layout of $A_1$ and $A_2$, to simplify the understanding of $G$. The dashed edges of the intersection automata correspond to $\varepsilon$-transitions (namely, transitions such that no letter is read when traversed), while the solid edges correspond to the other extended transitions.
  • ...and 4 more figures

Theorems & Definitions (57)

  • Conjecture 2.1: OV conjecture DBLP:journals/tcs/Williams05
  • Theorem 2.2
  • proof
  • Example 2.3
  • Corollary 2.4
  • Conjecture 2.5: BMM conjecture DBLP:conf/focs/AbboudW14
  • Theorem 2.6
  • proof
  • Corollary 2.8
  • Theorem 3.1
  • ...and 47 more