Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

Anouk Duyster; Tomasz Kociumaka

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

Anouk Duyster, Tomasz Kociumaka

TL;DR

This work resolves the optimal trade-off for Random Access in grammar-compressed strings by deriving a general time-space bound that hinges on the input's length $n$, grammar size $g$, alphabet size $\sigma$, data-structure size $M$, and word size $w$. The authors introduce a novel grammar-transformation pipeline—contracting, niceness, and leafiness—that yields shallow parse trees and enables constant-time local navigation, yielding an upper bound of $O\left(\frac{\log \frac{n\log\sigma}{Mw}}{\log \frac{Mw}{g\log n}}\right)$ query time with $O(M)$ space, for suitable parameter ranges, and they prove a matching unconditional lower bound in almost all regimes. Their framework extends beyond SLGs to Run-Length SLGs (RLSLGs), preserves expansions, and supports a suite of operations including substring extraction, rank, and select with near-optimal time and space, aided by prefix-sum structures and succinct leaf encodings like Elias–Fano. A key contribution is a robust, constructive approach that not only achieves near-uncompressed $O(1)$ access with sizable space but also interpolates smoothly between the classic $O(\log n)$ and $O(1)$ regimes as the grammar size and data-structure budget vary. Overall, the results provide tight, parameter-sensitive bounds for grammar-compressed Random Access and related queries, with efficient deterministic construction and wide applicability to broader grammar-compressed representations and compressed queries.

Abstract

A Random Access query to a string $T\in [0..σ)^n$ asks for the character $T[i]$ at a given position $i\in [0..n)$. In $O(n\logσ)$ bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where $T$ is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length $n$, grammar size (the total length of productions) $g$, alphabet size $σ$, data structure size $M$, and word size $w=Ω(\log n)$ of the word RAM model. For any $M$ with $g\log n<Mw<n\logσ$, we show an $O(M)$-size data structure with query time $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of $n$ only, achieving $O(\log n)$ time using $O(g)$ space [Bille et al.; SIAM J. Comput. 2015] and $O(\frac{\log n}{\log \log n})$ time using $O(g\log^ε n)$ space for any constant $ε> 0$ [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was $Ω(\frac{\log n}{\log\log n})$ for $w=Θ(\log n)$, $n^{Ω(1)}\le g\le n^{1-Ω(1)}$, and $M=g\log^{Θ(1)}n$. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

TL;DR

This work resolves the optimal trade-off for Random Access in grammar-compressed strings by deriving a general time-space bound that hinges on the input's length

, grammar size

, alphabet size

, data-structure size

, and word size

. The authors introduce a novel grammar-transformation pipeline—contracting, niceness, and leafiness—that yields shallow parse trees and enables constant-time local navigation, yielding an upper bound of

query time with

space, for suitable parameter ranges, and they prove a matching unconditional lower bound in almost all regimes. Their framework extends beyond SLGs to Run-Length SLGs (RLSLGs), preserves expansions, and supports a suite of operations including substring extraction, rank, and select with near-optimal time and space, aided by prefix-sum structures and succinct leaf encodings like Elias–Fano. A key contribution is a robust, constructive approach that not only achieves near-uncompressed

access with sizable space but also interpolates smoothly between the classic

and

regimes as the grammar size and data-structure budget vary. Overall, the results provide tight, parameter-sensitive bounds for grammar-compressed Random Access and related queries, with efficient deterministic construction and wide applicability to broader grammar-compressed representations and compressed queries.

Abstract

A Random Access query to a string

asks for the character

at a given position

. In

bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where

is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length

, grammar size (the total length of productions)

, alphabet size

, data structure size

, and word size

of the word RAM model. For any

with

, we show an

-size data structure with query time

. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of

only, achieving

time using

space [Bille et al.; SIAM J. Comput. 2015] and

time using

space for any constant

[Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was

for

, and

. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.

Paper Structure (36 sections, 39 theorems, 44 equations, 3 algorithms)

This paper contains 36 sections, 39 theorems, 44 equations, 3 algorithms.

Introduction
Beyond SLGs.
Beyond Random Access.
Our Techniques
Substring Extraction.
Prefix Aggregation, $\textsf{rank}$, and $\textsf{select}$.
Lower Bound.
Related Work
Open Problems
Preliminaries
Straight-Line Grammars.
Parse Trees.
The Upper Bounds: Overview
Random Access in Weighted Strings and Parse-Tree Viewpoint.
Contracting Grammars.
...and 21 more sections

Key Result

Theorem 1.1

Let $\mathcal{G}$ be an SLG of size $g$ generating a string $T\in [0.\,. \sigma)^n$. In the word RAM model with word size $w=\Omega(\log n)$, given $\mathcal{G}$ and any value $M$ with $g\log n < Mw < n\log \sigma$, one can in $\mathcal{O}(\frac{Mw}{\log n})$ time construct an $\mathcal{O}(M)$-size

Theorems & Definitions (89)

Theorem 1.1
Theorem 1.2
Corollary 1.2: see \ref{['section:appendix']}
Theorem 1.3
Theorem 1.4
Definition 2.1
Definition 4.1: Certificates Wang2014
Remark 4.2
Corollary 4.3
Lemma 4.4: VY13
...and 79 more

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

TL;DR

Abstract

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (89)