Table of Contents
Fetching ...

NP-Completeness for the Space-Optimality of Double-Array Tries

Hideo Bannai, Keisuke Goto, Shunsuke Kanda, Dominik Köppl

TL;DR

This paper draws a connection to the sparse matrix compression problem, which makes the problem NP-complete for alphabet sizes linear to the number of nodes, and proposes a reduction from the restricted directed Hamiltonian path problem, leading to NP-completeness even for logarithmic-sized alphabets.

Abstract

Indexing a set of strings for prefix search or membership queries is a fundamental task with many applications such as information retrieval or database systems. A classic abstract data type for modelling such an index is a trie. Due to the fundamental nature of this problem, it has sparked much interest, leading to a variety of trie implementations with different characteristics. A trie implementation that has been well-used in practice is the double-array (trie) consisting of merely two integer arrays. While a traversal takes constant time per node visit, the needed space consumption in computer words can be as large as the product of the number of nodes and the alphabet size. Despite that several heuristics have been proposed on lowering the space requirements, we are unaware of any theoretical guarantees. In this paper, we study the decision problem whether there exists a double-array of a given size. To this end, we first draw a connection to the sparse matrix compression problem, which makes our problem NP-complete for alphabet sizes linear to the number of nodes. We further propose a reduction from the restricted directed Hamiltonian path problem, leading to NP-completeness even for logarithmic-sized alphabets.

NP-Completeness for the Space-Optimality of Double-Array Tries

TL;DR

This paper draws a connection to the sparse matrix compression problem, which makes the problem NP-complete for alphabet sizes linear to the number of nodes, and proposes a reduction from the restricted directed Hamiltonian path problem, leading to NP-completeness even for logarithmic-sized alphabets.

Abstract

Indexing a set of strings for prefix search or membership queries is a fundamental task with many applications such as information retrieval or database systems. A classic abstract data type for modelling such an index is a trie. Due to the fundamental nature of this problem, it has sparked much interest, leading to a variety of trie implementations with different characteristics. A trie implementation that has been well-used in practice is the double-array (trie) consisting of merely two integer arrays. While a traversal takes constant time per node visit, the needed space consumption in computer words can be as large as the product of the number of nodes and the alphabet size. Despite that several heuristics have been proposed on lowering the space requirements, we are unaware of any theoretical guarantees. In this paper, we study the decision problem whether there exists a double-array of a given size. To this end, we first draw a connection to the sparse matrix compression problem, which makes our problem NP-complete for alphabet sizes linear to the number of nodes. We further propose a reduction from the restricted directed Hamiltonian path problem, leading to NP-completeness even for logarithmic-sized alphabets.
Paper Structure (8 sections, 6 theorems, 4 equations, 3 figures)

This paper contains 8 sections, 6 theorems, 4 equations, 3 figures.

Key Result

Lemma 2

For the case $\sigma = 2, 3$, can be solved in polynomial time.

Figures (3)

  • Figure 1: The structure of string $S_{i_k}$ in Eq. \ref{['eqSik']} connected with $C_{i_k}$ to its left and $C_{i_{k+1}}$ to its right, where $c(i_k,j) = i_{k+1}$. The leftmost blank box can be $v_{i_k}$ or $\ddagger$ (if $i_k = 1$). The rightmost blank box can be $w_{c(i_k,j)}$ or $\$$ (if $i_k = n$).
  • Figure 2: Constructing a shortest superstring for the input string set $\mathcal{T}'$ with the same building blocks as in Figure \ref{['fig:structure']}. The patch strings fill up all wildcard symbols.
  • Figure 3: Non-sub-block-aligned overlapping of $\alpha$ and $\overline{\alpha}$ to fill the leftmost wildcard. The figure assumes that there are no wildcards preceding (2). The wildcard (corresponding to $0$) in (2) may overlap with previously filled blocks. Since the $1$s are mapped to IDs distinct to each string, they must not overlap. For example, the overlap of (2) and (1) tries to fill the 2nd $0$ of (2) with another $\alpha$ (1), but is interfered with the last $1$ of (2) and the second to last $1$ of (2). Similarly, the overlap of (2) and (3) tries also to fill the 2nd $0$ of (2) with $\overline{\alpha}$ (3). The overlap of (3) and (4) tries to fill the first $0$ of (3) $\overline{\alpha}$ with $\overline{\alpha}$. Thus, the only way to fill the first $0$ in $\alpha$ or the first $0$ in $\overline{\alpha}$ is to align $\alpha$ and $\overline{\alpha}$.

Theorems & Definitions (7)

  • Example 1
  • Lemma 2
  • Theorem 3
  • Theorem 4
  • Lemma 5: Lemma 1 of gallant1980finding
  • Theorem 6: Theorem 1 of Gallant et al. gallant1980finding
  • Lemma 7