Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space
Dominik Kempa, Tomasz Kociumaka
TL;DR
This work establishes δ-SA, a suffix-array-like index operating in δ-optimal space, achieving efficient SA and ISA queries in polylogarithmic time while representing the text compactly as a δ-optimal Lempel–Ziv–based grammar. It delivers a deterministic, compressed-time construction from LZ77 parsing and introduces δ-compressed string synchronizing sets to harmonize strong query capabilities with compression bounds. The approach collapses the traditional hierarchy of compressed data structures to a single δ-optimal point and immediately improves the space efficiency of a wide range of algorithms relying on SA functionality. Beyond SA/ISA, the framework supports LCE queries, random access, and synchronizing-set computations within the same compressed footprint, with extensions to complex weighted range and modular constraint queries. The methods rely on deterministic restricted recompression, a δ-compressed cover hierarchy, and careful integration of LZ77 parsing with run-length grammar construction, enabling nearly optimal compressed-space indexing for highly repetitive texts and enabling practical, provably efficient query workflows in compressed space.
Abstract
In the last decades, the necessity to process massive amounts of textual data fueled the development of compressed text indexes: data structures efficiently answering queries on a given text while occupying space proportional to the compressed representation of the text. A widespread phenomenon in compressed indexing is that more powerful queries require larger indexes. For example, random access, the most basic query, can be supported in $O(δ\log\frac{n\logσ}{δ\log n})$ space (where $n$ is the text length, $σ$ is the alphabet size, and $δ$ is text's substring complexity), which is the asymptotically smallest space to represent a string, for all $n$, $σ$, and $δ$ (Kociumaka, Navarro, Prezza; IEEE Trans. Inf. Theory 2023). The other end of the hierarchy is occupied by indexes supporting the powerful suffix array (SA) queries. The currently smallest one takes $O(r\log\frac{n}{r})$ space, where $r\geqδ$ is the number of runs in the BWT of the text (Gagie, Navarro, Prezza; J. ACM 2020). We present a new compressed index that needs only $O(δ\log\frac{n\logσ}{δ\log n})$ space to support SA functionality in $O(\log^{4+ε} n)$ time. This collapses the hierarchy of compressed data structures into a single point: The space required to represent the text is simultaneously sufficient for efficient SA queries. Our result immediately improves the space complexity of dozens of algorithms, which can now be executed in optimal compressed space. In addition, we show how to construct our index in $O(δ\text{ polylog } n)$ time from the LZ77 parsing of the text. For highly repetitive texts, this is up to exponentially faster than the previously best algorithm. To obtain our results, we develop numerous techniques of independent interest, including the first $O(δ\log\frac{n\logσ}{δ\log n})$-size index for LCE queries.
