Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Lorraine A. K. Ayad; Grigorios Loukides; Solon P. Pissis; Hilde Verbeek

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Lorraine A. K. Ayad, Grigorios Loukides, Solon P. Pissis, Hilde Verbeek

TL;DR

The paper tackles the problem of sparse suffix sorting by introducing two direct, simple, and space-efficient SSA/SLCP construction algorithms. The main approach, Main-Algo, builds a hierarchy of LCP groups via KR fingerprint hashing and outputs SSA/SLCP through a DFS over this structure, achieving $O\left(n + \frac{bn}{s}\log s\right)$ time and $s+7b+o(b)$ space. To further optimize for practical, sparse inputs, the authors present Parameterized-Algo, which combines two Main-Algo runs with a merge step to achieve $O\left(n + \frac{b'n}{b}\log b\right)$ time and $8b+4b'+o(b)$ space, with $b'$ counting suffixes requiring extended sorting; under $b' = O(b/\log b)$, the method attains $O(n)$ time. The methods are Monte Carlo with high probability, provide direct outputs without constructing a sparse suffix tree or LCE, and show strong empirical performance, offering a practical baseline for real-world sparse indexing tasks.

Abstract

Sparse suffix sorting is the problem of sorting $b=o(n)$ suffixes of a string of length $n$. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in $\mathcal{O}(n\log b)$ time, in the worst case, or in $\mathcal{O}(n)$ time, when the total number of suffixes with an LCP value greater than $2^{\lfloor \log \frac{n}{b} \rfloor + 1}-1$ is in $\mathcal{O}(b/\log b)$, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only $8b+o(b)$ machine words. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in $\mathcal{O}(n\log b)$ time [STACS 2014]. We provide extensive experiments to justify our claims on simplicity and on efficiency.

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

TL;DR

time and

space. To further optimize for practical, sparse inputs, the authors present Parameterized-Algo, which combines two Main-Algo runs with a merge step to achieve

time and

space, with

counting suffixes requiring extended sorting; under

, the method attains

time. The methods are Monte Carlo with high probability, provide direct outputs without constructing a sparse suffix tree or LCE, and show strong empirical performance, offering a practical baseline for real-world sparse indexing tasks.

Abstract

Sparse suffix sorting is the problem of sorting

suffixes of a string of length

. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in

time, in the worst case, or in

time, when the total number of suffixes with an LCP value greater than

is in

, matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only

machine words. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in

time [STACS 2014]. We provide extensive experiments to justify our claims on simplicity and on efficiency.

Paper Structure (24 sections, 10 theorems, 1 equation, 2 figures, 2 tables, 3 algorithms)

This paper contains 24 sections, 10 theorems, 1 equation, 2 figures, 2 tables, 3 algorithms.

Introduction
Motivation.
Our Results.
Preliminaries
Karp-Rabin Fingerprints.
Main Algorithm
Overview.
Computing and Sorting the LCP Groups
Constructing the SSA and SLCP Array
A Full Working Example
Analysis
A Simple Parameterized Algorithm
Main Idea.
Description and Pseudocode.
Time Complexity.
...and 9 more sections

Key Result

Lemma 1

Any string $T\in\Sigma^n$ can be preprocessed in $\mathcal{O}(n)$ time using $s+\mathcal{O}(1)$ machine words, for any $s\in[1,n]$, so that the KR fingerprint of any length-$k$ fragment of $T$ is computed in $\mathcal{O}(\min\{k, n/s\})$ time.I et al. DBLP:conf/stacs/IKK14 claim $\mathcal{O}(s)$ spa

Figures (2)

Figure 1: Results for sparse instances of FASTQ (top row), AMAZON (middle row) and RANDOM (bottom row). The exact value of $b'$ is on top of the points of the PA curves to highlight the relevance of Theorem \ref{['the:param']}. Notably in all but two instances we have that $b'=0$. By default, we used $b=0.001\%\cdot n$ when varying $n$; and the whole dataset when varying $b$.
Figure 2: Results for dense instances of FASTQ (top row), AMAZON (middle row) and RANDOM (bottom row). The exact value of $b'$ is on top of the points of the PA curves to highlight the relevance of Theorem \ref{['the:param']}. As expected, $b'$ decreases with increasing $n$ and increases with increasing $b$. By default, we used $b=6\%\cdot n$ when varying $n$; and the whole dataset when varying $b$.

Theorems & Definitions (18)

Remark 1
Lemma 1: DBLP:conf/stacs/IKK14
Theorem 1
Lemma 2
proof
Lemma 3
proof
Theorem 2
Lemma 4
proof
...and 8 more

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

TL;DR

Abstract

Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (18)