Table of Contents
Fetching ...

Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space

Dmitry Kosolobov, Nikita Sivukhin

TL;DR

The paper addresses efficient construction of small-space string indexes, namely sparse suffix trees and LCE indexes, for readonly strings by leveraging a deterministic locally consistent parsing framework built on $\\tau$-partitioning sets. It combines a Cole–Vishkin style partitioning method with Je\\'z recompression to produce $\\tau$-partitioning sets of size $O(b)$ with $\\tau = n/b$, enabling an $O(b)$ space overhead on top of the input and near-linear construction time; for $b \\ge n^{\\varepsilon}$ this yields linear-time construction. The core results show that, for $\\tau$ in a broad range (including $\\tau \\ge 4$ up to $O(n/\\log^2 n)$), one can deterministically construct SSTs and LCE indexes in $\\mathcal{O}(n \\log_b n)$ time using $O(b)$ space, which subsumes and improves prior small-space deterministic constructions. The approach supports linear-time, linear-space ends for $b \\ge n^{\\varepsilon}$ and provides a general framework with tight time-space trade-offs, contributing a practical path toward optimal small-space string indexing under readonly inputs.

Abstract

The notions of synchronizing and partitioning sets are recently introduced variants of locally consistent parsings with great potential in problem-solving. In this paper we propose a deterministic algorithm that constructs for a given readonly string of length $n$ over the alphabet $\{0,1,\ldots,n^{\mathcal{O}(1)}\}$ a variant of $τ$-partitioning set with size $\mathcal{O}(b)$ and $τ= \frac{n}{b}$ using $\mathcal{O}(b)$ space and $\mathcal{O}(\frac{1}εn)$ time provided $b \ge n^ε$, for $ε> 0$. As a corollary, for $b \ge n^ε$ and constant $ε> 0$, we obtain linear construction algorithms with $\mathcal{O}(b)$ space on top of the string for two major small-space indexes: a sparse suffix tree, which is a compacted trie built on $b$ chosen suffixes of the string, and a longest common extension (LCE) index, which occupies $\mathcal{O}(b)$ space and allows us to compute the longest common prefix for any pair of substrings in $\mathcal{O}(n/b)$ time. For both, the $\mathcal{O}(b)$ construction storage is asymptotically optimal since the tree itself takes $\mathcal{O}(b)$ space and any LCE index with $\mathcal{O}(n/b)$ query time must occupy at least $\mathcal{O}(b)$ space by a known trade-off (at least for $b \ge Ω(n / \log n)$). In case of arbitrary $b \ge Ω(\log^2 n)$, we present construction algorithms for the partitioning set, sparse suffix tree, and LCE index with $\mathcal{O}(n\log_b n)$ running time and $\mathcal{O}(b)$ space, thus also improving the state of the art.

Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space

TL;DR

The paper addresses efficient construction of small-space string indexes, namely sparse suffix trees and LCE indexes, for readonly strings by leveraging a deterministic locally consistent parsing framework built on -partitioning sets. It combines a Cole–Vishkin style partitioning method with Je\\'z recompression to produce -partitioning sets of size with , enabling an space overhead on top of the input and near-linear construction time; for this yields linear-time construction. The core results show that, for in a broad range (including up to ), one can deterministically construct SSTs and LCE indexes in time using space, which subsumes and improves prior small-space deterministic constructions. The approach supports linear-time, linear-space ends for and provides a general framework with tight time-space trade-offs, contributing a practical path toward optimal small-space string indexing under readonly inputs.

Abstract

The notions of synchronizing and partitioning sets are recently introduced variants of locally consistent parsings with great potential in problem-solving. In this paper we propose a deterministic algorithm that constructs for a given readonly string of length over the alphabet a variant of -partitioning set with size and using space and time provided , for . As a corollary, for and constant , we obtain linear construction algorithms with space on top of the string for two major small-space indexes: a sparse suffix tree, which is a compacted trie built on chosen suffixes of the string, and a longest common extension (LCE) index, which occupies space and allows us to compute the longest common prefix for any pair of substrings in time. For both, the construction storage is asymptotically optimal since the tree itself takes space and any LCE index with query time must occupy at least space by a known trade-off (at least for ). In case of arbitrary , we present construction algorithms for the partitioning set, sparse suffix tree, and LCE index with running time and space, thus also improving the state of the art.

Paper Structure

This paper contains 26 sections, 26 theorems, 1 equation, 2 figures, 1 table.

Key Result

Lemma 1

For any $\tau' \ge \tau$, every $\tau$-partitioning set is also $\tau'$-partitioning.

Figures (2)

  • Figure 1: The $k$th phase. The heights of the dashed lines over $j_h$ are equal to $v_h$. Encircled positions are put into $S_k$: they are local minima of $v_h$, or are at the "boundaries" of all-$R$ regions, or form a gap of length ${>}2^k$. In the figure $R(j_{16}),\ldots, R(j_{20})$ hold and $R(j_{21})$ does not hold.
  • Figure 2: The scheme generating $a_p$ via $\mathop{\mathsf{vbit}}$ reductions. If a node $\hat{t}$ has ingoing edges labeled with $\tilde{t}, \tilde{t}_1, \tilde{t}_2, \ldots, \tilde{t}_r$ (from left to right), then $\hat{t}$ encodes a tuple $\langle \tilde{w}_1, \tilde{w}_2, \ldots, \tilde{w}_\ell \rangle$ such that, for $j \in [1..r]$, $\tilde{w}_j = \mathop{\mathsf{vbit}}(\tilde{t}, \tilde{t}_j)$ and, for $j \in (r..\ell]$, $\tilde{w}_j = \infty$. In the figure, the numbers $t, t_1, t_2, \ldots, t_{m+5}$ correspond to consecutive positions $p, p_1, p_2, \ldots, p_{m+5}$ in the set $S'$, respectively. By looking at which of the ingoing edges are present and which are not, one can deduce that here we have $S' \cap (p..p{+}\tau/2^5] = \{p_1, \ldots, p_m\}$, $S' \cap (p_1..p_1{+}\tau/2^5] = \{p_2, \ldots, p_m\}$, $S' \cap (p_2..p_2{+}\tau/2^5] = \{p_3, \ldots, p_m, p_{m+1}\}$, $S' \cap (p_m..p_m{+}\tau/2^5] = \{p_{m+1}\}$, $S' \cap (p_{m+1}..p_{m+1}{+}\tau/2^5] = \{p_{m+2}, p_{m+3}\}$, $S' \cap (p_{m+2}..p_{m+2}{+}\tau/2^5] = \{p_{m+3}\}$, $S' \cap (p_{m+3}..p_{m+3}{+}\tau/2^5] = \{p_{m+4}, p_{m+5}\}$.

Theorems & Definitions (26)

  • Lemma 1
  • Theorem 2
  • Lemma 2
  • Theorem 3
  • Theorem 4
  • Lemma 5: see ColeVishkin
  • Lemma 6: see ColeVishkin
  • Lemma 6
  • Lemma 6
  • Lemma 6
  • ...and 16 more