Elastic-Degenerate String Comparison
Esteban Gabory, Moses Njagi Mwaniki, Nadia Pisanti, Solon P. Pissis, Jakub Radoszewski, Michelle Sweering, Wiktor Zuba
TL;DR
The basic question here is how fast can the authors check whether the two languages they represent have a nonempty intersection, and it is shown that the problem is NP-complete.
Abstract
An elastic-degenerate (ED) string $T$ is a sequence of $n$ sets $T[1],\ldots,T[n]$ containing $m$ strings in total whose cumulative length is $N$. We call $n$, $m$, and $N$ the length, the cardinality and the size of $T$, respectively. The language of $T$ is defined as $L(T)=\{S_1 \cdots S_n\,:\,S_i \in T[i]$ for all $i\in[1,n]\}$. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem.For two ED strings $T_1$ and $T_2$ of lengths $n_1$ and $n_2$, cardinalities $m_1$ and $m_2$, and sizes $N_1$ and $N_2$, respectively, we show the following: - There is no $O((N_1N_2)^{1-ε})$-time algorithm, for any constant $ε>0$, for EDSI even when $T_1$ and $T_2$ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. - There is no combinatorial $O((N_1+N_2)^{1.2-ε}f(n_1,n_2))$-time algorithm, for any constant $ε>0$ and any function $f$, for EDSI even when $T_1$ and $T_2$ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. - An $O(N_1\log N_1\log n_1+N_2\log N_2\log n_2)$-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when $T_1$ and $T_2$ are given in a compact representation, we show that the problem is NP-complete. - An $O(N_1m_2+N_2m_1)$-time algorithm for EDSI. - An $\tilde{O}(N_1^{ω-1}n_2+N_2^{ω-1}n_1)$-time algorithm for EDSI, where $ω$ is the exponent of matrix multiplication; the $\tilde{O}$ notation suppresses factors that are polylogarithmic in the input size.
