Generalized Straight-Line Programs

Gonzalo Navarro; Francisco Olivares; Cristian Urbina

Generalized Straight-Line Programs

Gonzalo Navarro, Francisco Olivares, Cristian Urbina

TL;DR

The paper extends grammar-balancing techniques from SLPs to Generalized SLPs (GSLPs), introducing Iterated SLPs (ISLPs) and Run-Length SLPs (RLSLPs). It proves that balanced GSLPs exist with height O(log n) and linear-size overhead, and shows ISLPs can break the substring-complexity bound δ on certain text families while still supporting direct access and various substring queries in polylogarithmic time. The work also develops data-structures for navigating ISLPs, enabling direct access, substring extraction, and composable queries such as RMQ and NSV/PSV, with near-optimal time/space trade-offs, especially when restricting to low-d ISLPs or RLSLPs. A key application is efficient Karp-Rabin fingerprint computation on RLSLP-compressed texts, achieved via a general composable-function framework. Overall, the results advance practical grammar-compression-based indexing for highly repetitive texts.

Abstract

It was recently proved that any Straight-Line Program (SLP) generating a given string can be transformed in linear time into an equivalent balanced SLP of the same asymptotic size. We generalize this proof to a general class of grammars we call Generalized SLPs (GSLPs), which allow rules of the form $A \rightarrow x$ where $x$ is any Turing-complete representation (of size $|x|$) of a sequence of symbols (potentially much longer than $|x|$). We then specialize GSLPs to so-called Iterated SLPs (ISLPs), which allow rules of the form $A \rightarrow Π_{i=k_1}^{k_2} B_1^{i^{c_1}}\cdots B_t^{i^{c_t}}$ of size $2t+2$. We prove that ISLPs break, for some text families, the measure $δ$ based on substring complexity, a lower bound for most measures and compressors exploiting repetitiveness. Further, ISLPs can extract any substring of length $λ$, from the represented text $T[1.. n]$, in time $O(λ+ \log^2 n\log\log n)$. This is the first compressed representation for repetitive texts breaking $δ$ while, at the same time, supporting direct access to arbitrary text symbols in polylogarithmic time. We also show how to compute some substring queries, like range minima and next/previous smaller value, in time $O(\log^2 n \log\log n)$. Finally, we further specialize the grammars to Run-Length SLPs (RLSLPs), which restrict the rules allowed by ISLPs to the form $A \rightarrow B^t$. Apart from inheriting all the previous results with the term $\log^2 n \log\log n$ reduced to the near-optimal $\log n$, we show that RLSLPs can exploit balance to efficiently compute a wide class of substring queries we call ``composable'' -- i.e., $f(X \cdot Y)$ can be obtained from $f(X)$ and $f(Y)$...

Generalized Straight-Line Programs

TL;DR

Abstract

where

is any Turing-complete representation (of size

) of a sequence of symbols (potentially much longer than

). We then specialize GSLPs to so-called Iterated SLPs (ISLPs), which allow rules of the form

of size

. We prove that ISLPs break, for some text families, the measure

based on substring complexity, a lower bound for most measures and compressors exploiting repetitiveness. Further, ISLPs can extract any substring of length

, from the represented text

, in time

. This is the first compressed representation for repetitive texts breaking

while, at the same time, supporting direct access to arbitrary text symbols in polylogarithmic time. We also show how to compute some substring queries, like range minima and next/previous smaller value, in time

. Finally, we further specialize the grammars to Run-Length SLPs (RLSLPs), which restrict the rules allowed by ISLPs to the form

. Apart from inheriting all the previous results with the term

reduced to the near-optimal

, we show that RLSLPs can exploit balance to efficiently compute a wide class of substring queries we call ``composable'' -- i.e.,

can be obtained from

and

...

Paper Structure (26 sections, 23 theorems, 19 equations, 2 figures, 6 algorithms)

This paper contains 26 sections, 23 theorems, 19 equations, 2 figures, 6 algorithms.

Introduction
Preliminaries
Strings
Straight-Line Programs
Other Repetitiveness Measures
Burrows-Wheeler Transform.
Lempel-Ziv Parsing.
Bidirectional Macro Schemes.
String Attractors.
Substring Complexity.
L-systems.
Generalized SLPs and How to Balance Them
Iterated Straight-Line Programs
Accessing ISLPs
Data Structures
...and 11 more sections

Key Result

Lemma 1

(Ganardi et al. GJL2021) Let $D = (V , E)$ be a DAG. Then every node has at most one outgoing and at most one incoming edge from $E_{scd}(D)$. Furthermore, every path from the root r to a sink node contains at most $2\log_2 n(D)$ edges that do not belong to $E_{scd}(D)$.

Figures (2)

Figure 1: The DAG and SC-decomposition of an unfolded RLSLP generating the string $\texttt{0}(\texttt{0}(\texttt{0}\texttt{1})^6\texttt{1}^2)^6(\texttt{0}\texttt{1})^5\texttt{1}^3$. The value to the left of a node is the number of paths from the root to that node, and the value to the right is the number of paths from the node to sink nodes. Red edges belong to the SC-decomposition of the DAG. Blue (resp. green) edges branch from an SC-path to the left (resp. to the right).
Figure 2: Data structures built for the ISLP rule $A \rightarrow \prod_{i=1}^5B^{i}C^{i^2}D^{i}EEE^iB^{i^2}C^{i^3}D$, with $|\mathtt{exp}(B)| = 2$, $|\mathtt{exp}(C)| = 3$, $|\mathtt{exp}(D)| = 4$, and $|\mathtt{exp}(E)| = 7$. We show some of the polynomials to be simulated with these data structures.

Theorems & Definitions (44)

Definition 1
Definition 2
Lemma 1
Definition 3
Lemma 2
Theorem 3
proof
Definition 4
Definition 5
Proposition 4
...and 34 more

Generalized Straight-Line Programs

TL;DR

Abstract

Generalized Straight-Line Programs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (44)