Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

Dominik Kempa; Tomasz Kociumaka

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

Dominik Kempa, Tomasz Kociumaka

TL;DR

This work establishes δ-SA, a suffix-array-like index operating in δ-optimal space, achieving efficient SA and ISA queries in polylogarithmic time while representing the text compactly as a δ-optimal Lempel–Ziv–based grammar. It delivers a deterministic, compressed-time construction from LZ77 parsing and introduces δ-compressed string synchronizing sets to harmonize strong query capabilities with compression bounds. The approach collapses the traditional hierarchy of compressed data structures to a single δ-optimal point and immediately improves the space efficiency of a wide range of algorithms relying on SA functionality. Beyond SA/ISA, the framework supports LCE queries, random access, and synchronizing-set computations within the same compressed footprint, with extensions to complex weighted range and modular constraint queries. The methods rely on deterministic restricted recompression, a δ-compressed cover hierarchy, and careful integration of LZ77 parsing with run-length grammar construction, enabling nearly optimal compressed-space indexing for highly repetitive texts and enabling practical, provably efficient query workflows in compressed space.

Abstract

In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access, the most basic query, can be supported in $O(δ\log\frac{n\logσ}{δ\log n})$ space (where $n$ is the text length, $σ$ is the alphabet size, and $δ$ is text's substring complexity), which is the asymptotically smallest space to represent a string, for all $n$, $σ$, and $δ$ (Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the powerful suffix array (SA) queries. The currently smallest one takes $O(r\log\frac{n}{r})$ space, where $r\geqδ$ is the number of runs in the BWT of the text (Gagie, Navarro, Prezza; J. ACM 2020). We present a new compressed index that needs only $O(δ\log\frac{n\logσ}{δ\log n})$ space to support SA functionality in $O(\log^{4+ε} n)$ time. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient for efficient SA queries. Our result immediately improves the space complexity of dozens of algorithms, which can now be executed in optimal compressed space. In addition, we show how to construct our index in $O(δ\text{ polylog } n)$ time from the LZ77 parsing of the text. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm. To obtain our results, we develop numerous techniques of independent interest, including the first $O(δ\log\frac{n\logσ}{δ\log n})$-size index for LCE queries.

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

TL;DR

Abstract

space (where

is the text length,

is the alphabet size, and

is text's substring complexity), which is the asymptotically smallest space to represent a string, for all

, and

(Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the powerful suffix array (SA) queries. The currently smallest one takes

space, where

is the number of runs in the BWT of the text (Gagie, Navarro, Prezza; J. ACM 2020). We present a new compressed index that needs only

space to support SA functionality in

time. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient for efficient SA queries. Our result immediately improves the space complexity of dozens of algorithms, which can now be executed in optimal compressed space. In addition, we show how to construct our index in

time from the LZ77 parsing of the text. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm. To obtain our results, we develop numerous techniques of independent interest, including the first

-size index for LCE queries.

Paper Structure (117 sections, 172 theorems, 69 equations, 1 figure)

This paper contains 117 sections, 172 theorems, 69 equations, 1 figure.

Introduction
Our Results
Related Work
Organization of the Paper
Preliminaries
Basic definitions
Suffix array
Substring complexity
Lempel--Ziv compression
String Synchronizing Sets
Model of computation
Technical Overview
SA and ISA Queries
The Basic Idea
The Nonperiodic Positions
...and 102 more sections

Key Result

Theorem 1.1

Given the LZ77 parsing of $T \in [0 \mathinner{.\,.} \sigma)^{n}$ and any constant $\epsilon \in (0, 1)$, we can in $\mathcal{O}(\delta \log^7 n)$ time construct a data structure of size $\mathcal{O}(\delta \log \tfrac{n \log \sigma}{\delta \log n})$ (where $\delta$ is the substring complexity of $T

Figures (1)

Figure 1: A list of all sorted suffixes of $T = \texttt{bbabaababababaababa}$ along with the suffix array.

Theorems & Definitions (369)

Theorem 1.1: $\delta$-SA
Definition 2.1
Remark 2.2
Proposition 2.3
proof
Definition 2.4: $\tau$-synchronizing set sss
Remark 2.5
Theorem 2.6: sss
Definition 3.1
Definition 4.1: Cover
...and 359 more

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

TL;DR

Abstract

Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (369)