Compressed Index with Construction in Compressed Space
Dmitry Kosolobov
TL;DR
The paper introduces a compressed index that achieves $O\left(\delta \log \frac{n}{\delta}\right)$ space and $O\left(m + (\mathrm{occ}+1)\log^{\varepsilon} n\right)$ search time, with a streaming construction in $O(n \log n)$ time and no reliance on Karp–Rabin fingerprints. Central to the approach is a hierarchical, jiggle-augmented block structure culminating in the jiggly block tree $J$, which maintains leftmost representatives and supports efficient substring fingerprints and traversal. The method uses deterministic fingerprints and sophisticated data structures (z-fast tries, range reporting, weighted ancestors) to achieve near-optimal space and the best-known search-time guarantees in compressed space, while providing a deterministic construction pathway. Although primarily theoretical due to large constants, the framework offers new insights into practical design of compressed indexes and potentially improved bounds in packed models.
Abstract
Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $δ$ is a compression measure for $s$ called string complexity. We describe an index on $s$ with $O(δ\log\frac{n}δ)$ space, measured in $O(\log n)$-bit machine words, that can search in $s$ any string of length $m$ in $O(m + (\mathrm{occ} + 1)\log^εn)$ time, where $\mathrm{occ}$ is the number of found occurrences and $ε> 0$ is any fixed constant (the big-O in the space bound hides factor $\frac{1}ε$). Crucially, the index can be built within this space in $O(n\log n)$ expected time by one left-to-right pass on the string $s$ in a streaming fashion. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is $O(δ\log\frac{n}{δα})$, where $α= \log_σn$ and $σ$ is the alphabet size, and it coincides with $O(δ\log\frac{n}δ)$ when $δ= O(n / α^2)$). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for $δ\ge Ω(\log\log n)$.
