Table of Contents
Fetching ...

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

Anouk Duyster, Tomasz Kociumaka

TL;DR

This work resolves the optimal trade-off for Random Access in grammar-compressed strings by deriving a general time-space bound that hinges on the input's length $n$, grammar size $g$, alphabet size $\sigma$, data-structure size $M$, and word size $w$. The authors introduce a novel grammar-transformation pipeline—contracting, niceness, and leafiness—that yields shallow parse trees and enables constant-time local navigation, yielding an upper bound of $O\left(\frac{\log \frac{n\log\sigma}{Mw}}{\log \frac{Mw}{g\log n}}\right)$ query time with $O(M)$ space, for suitable parameter ranges, and they prove a matching unconditional lower bound in almost all regimes. Their framework extends beyond SLGs to Run-Length SLGs (RLSLGs), preserves expansions, and supports a suite of operations including substring extraction, rank, and select with near-optimal time and space, aided by prefix-sum structures and succinct leaf encodings like Elias–Fano. A key contribution is a robust, constructive approach that not only achieves near-uncompressed $O(1)$ access with sizable space but also interpolates smoothly between the classic $O(\log n)$ and $O(1)$ regimes as the grammar size and data-structure budget vary. Overall, the results provide tight, parameter-sensitive bounds for grammar-compressed Random Access and related queries, with efficient deterministic construction and wide applicability to broader grammar-compressed representations and compressed queries.

Abstract

A Random Access query to a string $T\in [0..σ)^n$ asks for the character $T[i]$ at a given position $i\in [0..n)$. In $O(n\logσ)$ bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where $T$ is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length $n$, grammar size (the total length of productions) $g$, alphabet size $σ$, data structure size $M$, and word size $w=Ω(\log n)$ of the word RAM model. For any $M$ with $g\log n<Mw<n\logσ$, we show an $O(M)$-size data structure with query time $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of $n$ only, achieving $O(\log n)$ time using $O(g)$ space [Bille et al.; SIAM J. Comput. 2015] and $O(\frac{\log n}{\log \log n})$ time using $O(g\log^ε n)$ space for any constant $ε> 0$ [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was $Ω(\frac{\log n}{\log\log n})$ for $w=Θ(\log n)$, $n^{Ω(1)}\le g\le n^{1-Ω(1)}$, and $M=g\log^{Θ(1)}n$. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

TL;DR

This work resolves the optimal trade-off for Random Access in grammar-compressed strings by deriving a general time-space bound that hinges on the input's length , grammar size , alphabet size , data-structure size , and word size . The authors introduce a novel grammar-transformation pipeline—contracting, niceness, and leafiness—that yields shallow parse trees and enables constant-time local navigation, yielding an upper bound of query time with space, for suitable parameter ranges, and they prove a matching unconditional lower bound in almost all regimes. Their framework extends beyond SLGs to Run-Length SLGs (RLSLGs), preserves expansions, and supports a suite of operations including substring extraction, rank, and select with near-optimal time and space, aided by prefix-sum structures and succinct leaf encodings like Elias–Fano. A key contribution is a robust, constructive approach that not only achieves near-uncompressed access with sizable space but also interpolates smoothly between the classic and regimes as the grammar size and data-structure budget vary. Overall, the results provide tight, parameter-sensitive bounds for grammar-compressed Random Access and related queries, with efficient deterministic construction and wide applicability to broader grammar-compressed representations and compressed queries.

Abstract

A Random Access query to a string asks for the character at a given position . In bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length , grammar size (the total length of productions) , alphabet size , data structure size , and word size of the word RAM model. For any with , we show an -size data structure with query time . Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of only, achieving time using space [Bille et al.; SIAM J. Comput. 2015] and time using space for any constant [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was for , , and . In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.
Paper Structure (36 sections, 39 theorems, 44 equations, 3 algorithms)

This paper contains 36 sections, 39 theorems, 44 equations, 3 algorithms.

Key Result

Theorem 1.1

Let $\mathcal{G}$ be an SLG of size $g$ generating a string $T\in [0.\,. \sigma)^n$. In the word RAM model with word size $w=\Omega(\log n)$, given $\mathcal{G}$ and any value $M$ with $g\log n < Mw < n\log \sigma$, one can in $\mathcal{O}(\frac{Mw}{\log n})$ time construct an $\mathcal{O}(M)$-size

Theorems & Definitions (89)

  • Theorem 1.1
  • Theorem 1.2
  • Corollary 1.2: see \ref{['section:appendix']}
  • Theorem 1.3
  • Theorem 1.4
  • Definition 2.1
  • Definition 4.1: Certificates Wang2014
  • Remark 4.2
  • Corollary 4.3
  • Lemma 4.4: VY13
  • ...and 79 more