Table of Contents
Fetching ...

Space-efficient SLP encoding for $O(\log N)$-time random access

Akito Takasaka, Tomohiro I

TL;DR

The paper tackles the problem of random access on grammar-compressed strings represented by Straight-Line Programs (SLPs). It introduces a novel space-efficient encoding framework that leverages symmetric centroid decomposition (SC-paths) of the SLP DAG and compacted binary tries to support interval-biased searches, enabling worst-case substring extraction in $O(\log N + q - p)$ time. It provides three encodings with distinct space bounds: (I) $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+\sigma) \rceil + 4n - 2n' + o(n)$ bits, (II) $n \lceil \lg N \rceil + n \lceil \lg (n+\sigma) \rceil + 5n + n' + o(n)$ bits, and (III) $n \lceil \lg N \rceil + n \lceil \lg (n+\sigma) \rceil + 5n - n' + \sigma + o(n+\sigma)$ bits. These encodings achieve near-optimal time bounds for random access, matching known lower bounds up to constant factors, and significantly reduce space compared to naïve encodings. The work advances practical grammar-compressed data structures by delivering explicit, provably compact encodings with rigorous time guarantees for substring extraction on compressed strings.

Abstract

A Straight-Line Program (SLP) $G$ for a string $T$ is a context-free grammar (CFG) that derives $T$ only, which can be considered as a compressed representation of $T$. In this paper, we show how to encode $G$ in $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+σ) \rceil + 4n - 2n' + o(n)$ bits to support random access queries of extracting $T[p..q]$ in worst-case $O(\log N + q - p)$ time, where $N$ is the length of $T$, $σ$ is the alphabet size, $n$ is the number of variables in $G$ and $n' \le n$ is the number of symmetric centroid paths in the DAG representation for $G$. The time complexity is almost optimal because Verbin and Yu [CPM 2013] proved that $O(\log N)$ term cannot be significantly improved in general with $\mathrm{poly}(n)$-space data structures. We also present alternative encodings that achieve the same random access time with $n \lceil \lg N \rceil + n \lceil \lg (n+σ) \rceil + 5n + n' + o(n)$ or $n \lceil \lg N \rceil + n \lceil \lg (n+σ) \rceil + 5n - n' + σ+ o(n+σ)$ bits of space.

Space-efficient SLP encoding for $O(\log N)$-time random access

TL;DR

The paper tackles the problem of random access on grammar-compressed strings represented by Straight-Line Programs (SLPs). It introduces a novel space-efficient encoding framework that leverages symmetric centroid decomposition (SC-paths) of the SLP DAG and compacted binary tries to support interval-biased searches, enabling worst-case substring extraction in time. It provides three encodings with distinct space bounds: (I) bits, (II) bits, and (III) bits. These encodings achieve near-optimal time bounds for random access, matching known lower bounds up to constant factors, and significantly reduce space compared to naïve encodings. The work advances practical grammar-compressed data structures by delivering explicit, provably compact encodings with rigorous time guarantees for substring extraction on compressed strings.

Abstract

A Straight-Line Program (SLP) for a string is a context-free grammar (CFG) that derives only, which can be considered as a compressed representation of . In this paper, we show how to encode in bits to support random access queries of extracting in worst-case time, where is the length of , is the alphabet size, is the number of variables in and is the number of symmetric centroid paths in the DAG representation for . The time complexity is almost optimal because Verbin and Yu [CPM 2013] proved that term cannot be significantly improved in general with -space data structures. We also present alternative encodings that achieve the same random access time with or bits of space.
Paper Structure (13 sections, 4 theorems, 2 equations, 3 figures)

This paper contains 13 sections, 4 theorems, 2 equations, 3 figures.

Key Result

Theorem 1

Let $\mathcal{T}$ be a string of length $N$ over an alphabet of size $\sigma$. An SLP $\mathcal{G}$ for $\mathcal{T}$ can be encoded in (I) $n \lceil \lg N \rceil + (n + n') \lceil \lg (n+\sigma) \rceil + 4n - 2n' + o(n)$, (II) $n \lceil \lg N \rceil + n \lceil \lg (n+\sigma) \rceil + 5n + n' + o(n)

Figures (3)

  • Figure 1: Illustration for the high-level strategy to achieve $O(\log N)$-time random access. The path from the root to the target leaf $x_{e+1}^{\mathsf{in}} = \mathcal{T}[p]$ contains $e~(\le 2 \lg N)$ non-SC-edges $(x_1^{\mathsf{out}}, x_2^{\mathsf{in}})$, $(x_2^{\mathsf{out}}, x_3^{\mathsf{in}})$, $\dots$ and $(x_{e}^{\mathsf{out}}, x_{e+1}^{\mathsf{in}})$ depicted by dashed arrows. The components connected by plain arrows are SC-paths. Our sub-goal is to move from $x_{i}^{\mathsf{in}}$ to $x_{i+1}^{\mathsf{in}}$ efficiently in $O(1 + \log |\langle x_{i}^{\mathsf{in}}\rangle| - \log |\langle x_{i+1}^{\mathsf{in}}\rangle|)$ time.
  • Figure 2: Illustration for our encoding (I). Supposing that the $r$-th SC-path has $6$ nodes $(u_1, u_2, u_3, u_4, u_5, u_6)$ in the form depicted above, the layout of the information for this SC-path in $\mathsf{P}$, $\mathsf{D}$, $\mathsf{R}_{1}$, $\mathsf{R}_{2}$, $\mathsf{G}$ and $\mathsf{B}$ is shown below.
  • Figure 3: Illustration for our encoding (II). Supposing that the $r$-th SC-path has $6$ nodes $(u_1, u_2, u_3, u_4, u_5, u_6)$ in the form depicted left above and has three child SC-paths on $\mathsf{T_{E}}$ starting with $v_2, v_4$ and $v_7$, the layout of the information for this SC-path in $\mathsf{P}$, $\mathsf{D}$, $\mathsf{M_{E}}$ and $\mathsf{R_{E}}$ is shown below.

Theorems & Definitions (8)

  • Theorem 1
  • Lemma 2: 2007RamanRS_SuccinIndexDictionWithApplic
  • Lemma 3
  • proof
  • Lemma 5
  • proof
  • proof
  • proof