Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes
Anouk Duyster, Tomasz Kociumaka
TL;DR
This work resolves the optimal trade-off for Random Access in grammar-compressed strings by deriving a general time-space bound that hinges on the input's length $n$, grammar size $g$, alphabet size $\sigma$, data-structure size $M$, and word size $w$. The authors introduce a novel grammar-transformation pipeline—contracting, niceness, and leafiness—that yields shallow parse trees and enables constant-time local navigation, yielding an upper bound of $O\left(\frac{\log \frac{n\log\sigma}{Mw}}{\log \frac{Mw}{g\log n}}\right)$ query time with $O(M)$ space, for suitable parameter ranges, and they prove a matching unconditional lower bound in almost all regimes. Their framework extends beyond SLGs to Run-Length SLGs (RLSLGs), preserves expansions, and supports a suite of operations including substring extraction, rank, and select with near-optimal time and space, aided by prefix-sum structures and succinct leaf encodings like Elias–Fano. A key contribution is a robust, constructive approach that not only achieves near-uncompressed $O(1)$ access with sizable space but also interpolates smoothly between the classic $O(\log n)$ and $O(1)$ regimes as the grammar size and data-structure budget vary. Overall, the results provide tight, parameter-sensitive bounds for grammar-compressed Random Access and related queries, with efficient deterministic construction and wide applicability to broader grammar-compressed representations and compressed queries.
Abstract
A Random Access query to a string $T\in [0..σ)^n$ asks for the character $T[i]$ at a given position $i\in [0..n)$. In $O(n\logσ)$ bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where $T$ is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length $n$, grammar size (the total length of productions) $g$, alphabet size $σ$, data structure size $M$, and word size $w=Ω(\log n)$ of the word RAM model. For any $M$ with $g\log n<Mw<n\logσ$, we show an $O(M)$-size data structure with query time $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of $n$ only, achieving $O(\log n)$ time using $O(g)$ space [Bille et al.; SIAM J. Comput. 2015] and $O(\frac{\log n}{\log \log n})$ time using $O(g\log^ε n)$ space for any constant $ε> 0$ [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was $Ω(\frac{\log n}{\log\log n})$ for $w=Θ(\log n)$, $n^{Ω(1)}\le g\le n^{1-Ω(1)}$, and $M=g\log^{Θ(1)}n$. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.
