Counting on General Run-Length Grammars
Gonzalo Navarro, Alejandro Pacheco
TL;DR
This work tackles counting pattern occurrences in texts represented by arbitrary run-length context-free grammars (RLCFGs), an open problem in prior CFG/RLCFG indexing work. It introduces a counting index with space $O(g_{rl})$ and query time $O(m\log^{2+\epsilon} n)$ for any fixed $\epsilon>0$, matching the best CFG-based results while handling the generality of RLCFGs. The authors classify run-length rules into two types and develop two complementary data structures: an enhanced grid approach for type-E rules and period-based structures for type-L rules, enabling exact counting across all primary and secondary occurrences. The solution is achieved with $O(n\log n)$ expected-time indexing, demonstrates practical applicability to maximal exact matches (MEMs) and $k$-MEMs, and broadens the utility of compressed-text processing to arbitrary RLCFGs with near-optimal asymptotic guarantees.
Abstract
We introduce a data structure for counting pattern occurrences in texts compressed with any run-length context-free grammar. Our structure uses space proportional to the grammar size and counts the occurrences of a pattern of length $m$ in a text of length $n$ in time \(O(m\log^{2+ε} n)\), for any constant \(ε> 0\) chosen at indexing time. This is the first solution to an open problem posed by Christiansen et al.~[ACM TALG 2020] and enhances our abilities for computation over compressed data; we give an example application.
