Table of Contents
Fetching ...

Compressed Index with Construction in Compressed Space

Dmitry Kosolobov

TL;DR

The paper introduces a compressed index that achieves $O\left(\delta \log \frac{n}{\delta}\right)$ space and $O\left(m + (\mathrm{occ}+1)\log^{\varepsilon} n\right)$ search time, with a streaming construction in $O(n \log n)$ time and no reliance on Karp–Rabin fingerprints. Central to the approach is a hierarchical, jiggle-augmented block structure culminating in the jiggly block tree $J$, which maintains leftmost representatives and supports efficient substring fingerprints and traversal. The method uses deterministic fingerprints and sophisticated data structures (z-fast tries, range reporting, weighted ancestors) to achieve near-optimal space and the best-known search-time guarantees in compressed space, while providing a deterministic construction pathway. Although primarily theoretical due to large constants, the framework offers new insights into practical design of compressed indexes and potentially improved bounds in packed models.

Abstract

Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $δ$ is a compression measure for $s$ called string complexity. We describe an index on $s$ with $O(δ\log\frac{n}δ)$ space, measured in $O(\log n)$-bit machine words, that can search in $s$ any string of length $m$ in $O(m + (\mathrm{occ} + 1)\log^εn)$ time, where $\mathrm{occ}$ is the number of found occurrences and $ε> 0$ is any fixed constant (the big-O in the space bound hides factor $\frac{1}ε$). Crucially, the index can be built within this space in $O(n\log n)$ expected time by one left-to-right pass on the string $s$ in a streaming fashion. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is $O(δ\log\frac{n}{δα})$, where $α= \log_σn$ and $σ$ is the alphabet size, and it coincides with $O(δ\log\frac{n}δ)$ when $δ= O(n / α^2)$). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for $δ\ge Ω(\log\log n)$.

Compressed Index with Construction in Compressed Space

TL;DR

The paper introduces a compressed index that achieves space and search time, with a streaming construction in time and no reliance on Karp–Rabin fingerprints. Central to the approach is a hierarchical, jiggle-augmented block structure culminating in the jiggly block tree , which maintains leftmost representatives and supports efficient substring fingerprints and traversal. The method uses deterministic fingerprints and sophisticated data structures (z-fast tries, range reporting, weighted ancestors) to achieve near-optimal space and the best-known search-time guarantees in compressed space, while providing a deterministic construction pathway. Although primarily theoretical due to large constants, the framework offers new insights into practical design of compressed indexes and potentially improved bounds in packed models.

Abstract

Suppose that we are given a string of length over an alphabet and is a compression measure for called string complexity. We describe an index on with space, measured in -bit machine words, that can search in any string of length in time, where is the number of found occurrences and is any fixed constant (the big-O in the space bound hides factor ). Crucially, the index can be built within this space in expected time by one left-to-right pass on the string in a streaming fashion. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is , where and is the alphabet size, and it coincides with when ). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for .
Paper Structure (13 sections, 17 theorems, 3 figures)

This paper contains 13 sections, 17 theorems, 3 figures.

Key Result

Lemma 1

Given a string $a_0 a_1 \cdots a_m$ over an alphabet $[0\,..\, 2^u)$ such that $a_{i-1} \ne a_{i}$ for any $i \in [1\,..\, m]$, the string $b_1 b_2 \cdots b_{m}$ such that $b_i = \mathsf{vbit}(a_{i-1}, a_{i})$, for $i \in [1\,..\, m]$, satisfies $b_{i-1} \ne b_{i}$, for any $i \in [2\,..\, m]$, and

Figures (3)

  • Figure 1: A schematic depiction of the hierarchy of blocks $H$. Colored rectangles with the same color depict blocks with the same $\mathsf{id}$; white rectangles are blocks with any $\mathsf{id}$. From left to right: the hierarchy $H$, $H$ with rules 1--2 applied, $H$ with rule 4 applied, $H$ with rule 3 applied (i.e., it is $\hat{H}$).
  • Figure 2: A schematic depiction of fingerprints of two equal substrings of $s$. The blue rectangles denote blocks that form the fingerprints. The red rectangles denote blocks that intersect the substrings but can be "inconsistent" and are not in the fingerprints.
  • Figure 5: A schematic depiction of the tries $T_\circ$ and $\overset{{}_{\leftarrow}}{T}_\circ$ with the set of points $P$. The subset $P_z \subseteq P$ corresponding to a node $z\in T_\circ$ is depicted as split into chunks $P_z^a$. In each chunk $P_z^a$, its points with maximum and minimum $y$-coordinate are drawn red. The van Emde Boas structure $V_z$ stores the values $\mathsf{v}(y)$ for $y$-coordinates of exactly these red points (thus, $O(|P_z| / \log\log n)$ values in total). The data structure $Q_z$ stores, for each chunk $P_z^a$, the minimum of $p_{x,y}$ for all $(x,y) \in P_z^a$. The elements of each chunk $P_z^a$ are not stored explicitly but can be retrieved in $O(\log n)$ time using one range reporting query on $P$.

Theorems & Definitions (17)

  • Lemma 1: see ColeVishkinKosolobovSivukhin
  • Lemma 1: local consistency
  • Lemma 1: local sparsity
  • Lemma 1
  • Lemma 1
  • Lemma 2
  • Theorem 3
  • Lemma 3
  • Lemma 3: fingerprints
  • Theorem 4
  • ...and 7 more