Generalized Straight-Line Programs
Gonzalo Navarro, Francisco Olivares, Cristian Urbina
TL;DR
The paper extends grammar-balancing techniques from SLPs to Generalized SLPs (GSLPs), introducing Iterated SLPs (ISLPs) and Run-Length SLPs (RLSLPs). It proves that balanced GSLPs exist with height O(log n) and linear-size overhead, and shows ISLPs can break the substring-complexity bound δ on certain text families while still supporting direct access and various substring queries in polylogarithmic time. The work also develops data-structures for navigating ISLPs, enabling direct access, substring extraction, and composable queries such as RMQ and NSV/PSV, with near-optimal time/space trade-offs, especially when restricting to low-d ISLPs or RLSLPs. A key application is efficient Karp-Rabin fingerprint computation on RLSLP-compressed texts, achieved via a general composable-function framework. Overall, the results advance practical grammar-compression-based indexing for highly repetitive texts.
Abstract
It was recently proved that any Straight-Line Program (SLP) generating a given string can be transformed in linear time into an equivalent balanced SLP of the same asymptotic size. We generalize this proof to a general class of grammars we call Generalized SLPs (GSLPs), which allow rules of the form $A \rightarrow x$ where $x$ is any Turing-complete representation (of size $|x|$) of a sequence of symbols (potentially much longer than $|x|$). We then specialize GSLPs to so-called Iterated SLPs (ISLPs), which allow rules of the form $A \rightarrow Π_{i=k_1}^{k_2} B_1^{i^{c_1}}\cdots B_t^{i^{c_t}}$ of size $2t+2$. We prove that ISLPs break, for some text families, the measure $δ$ based on substring complexity, a lower bound for most measures and compressors exploiting repetitiveness. Further, ISLPs can extract any substring of length $λ$, from the represented text $T[1.. n]$, in time $O(λ+ \log^2 n\log\log n)$. This is the first compressed representation for repetitive texts breaking $δ$ while, at the same time, supporting direct access to arbitrary text symbols in polylogarithmic time. We also show how to compute some substring queries, like range minima and next/previous smaller value, in time $O(\log^2 n \log\log n)$. Finally, we further specialize the grammars to Run-Length SLPs (RLSLPs), which restrict the rules allowed by ISLPs to the form $A \rightarrow B^t$. Apart from inheriting all the previous results with the term $\log^2 n \log\log n$ reduced to the near-optimal $\log n$, we show that RLSLPs can exploit balance to efficiently compute a wide class of substring queries we call ``composable'' -- i.e., $f(X \cdot Y)$ can be obtained from $f(X)$ and $f(Y)$...
