Table of Contents
Fetching ...

Counting on General Run-Length Grammars

Gonzalo Navarro, Alejandro Pacheco

TL;DR

This work tackles counting pattern occurrences in texts represented by arbitrary run-length context-free grammars (RLCFGs), an open problem in prior CFG/RLCFG indexing work. It introduces a counting index with space $O(g_{rl})$ and query time $O(m\log^{2+\epsilon} n)$ for any fixed $\epsilon>0$, matching the best CFG-based results while handling the generality of RLCFGs. The authors classify run-length rules into two types and develop two complementary data structures: an enhanced grid approach for type-E rules and period-based structures for type-L rules, enabling exact counting across all primary and secondary occurrences. The solution is achieved with $O(n\log n)$ expected-time indexing, demonstrates practical applicability to maximal exact matches (MEMs) and $k$-MEMs, and broadens the utility of compressed-text processing to arbitrary RLCFGs with near-optimal asymptotic guarantees.

Abstract

We introduce a data structure for counting pattern occurrences in texts compressed with any run-length context-free grammar. Our structure uses space proportional to the grammar size and counts the occurrences of a pattern of length $m$ in a text of length $n$ in time \(O(m\log^{2+ε} n)\), for any constant \(ε> 0\) chosen at indexing time. This is the first solution to an open problem posed by Christiansen et al.~[ACM TALG 2020] and enhances our abilities for computation over compressed data; we give an example application.

Counting on General Run-Length Grammars

TL;DR

This work tackles counting pattern occurrences in texts represented by arbitrary run-length context-free grammars (RLCFGs), an open problem in prior CFG/RLCFG indexing work. It introduces a counting index with space and query time for any fixed , matching the best CFG-based results while handling the generality of RLCFGs. The authors classify run-length rules into two types and develop two complementary data structures: an enhanced grid approach for type-E rules and period-based structures for type-L rules, enabling exact counting across all primary and secondary occurrences. The solution is achieved with expected-time indexing, demonstrates practical applicability to maximal exact matches (MEMs) and -MEMs, and broadens the utility of compressed-text processing to arbitrary RLCFGs with near-optimal asymptotic guarantees.

Abstract

We introduce a data structure for counting pattern occurrences in texts compressed with any run-length context-free grammar. Our structure uses space proportional to the grammar size and counts the occurrences of a pattern of length in a text of length in time \(O(m\log^{2+ε} n)\), for any constant chosen at indexing time. This is the first solution to an open problem posed by Christiansen et al.~[ACM TALG 2020] and enhances our abilities for computation over compressed data; we give an example application.
Paper Structure (18 sections, 9 theorems, 5 equations, 5 figures)

This paper contains 18 sections, 9 theorems, 5 equations, 5 figures.

Key Result

Lemma 2

If $p$ and $p'$ are periods of $S$and $|S| \ge p+p'-\gcd(p,p')$, then $\gcd(p,p')$ is a period of $S$. Thus, $p(S)$ divides all other periods $p \le |S|/2$ of $S$.

Figures (5)

  • Figure 1: On the left, a grammar tree for $T=\mathtt{abracadabra}$ (with straight solid edges), so $\exp(X_4)=T$. Dashed edges were removed from the parse tree. The only primary occurrence of $P=\mathtt{abra}$ in $T$ is marked with dark gray on the bottom; the secondary ones are in light gray. On the right, the grid used for searching primary occurrences. Gray stripes indicate the search ranges corresponding to the partition $P = R \ | \ Q$, where $R = \texttt{a}$ and $Q = \texttt{bra}$. The value $4$ stored in the resulting cell is the preorder of the child $X_5$ of the locus node $X_2$ where $Q$ starts.
  • Figure 2: We show the occurrences captured by the point $(x_p, y_p") = (\exp(\hat{B}), \exp(\hat{B})^2)$. Note how the occurrence in the first row is correctly captured by $(x_p, y_p")$, whereas that in the second row is not captured by any point. Consequently, the first row is effectively counted twice. Given that the point $(x_p, y_p")$ is assigned a weight of $2 \cdot (s-1) \cdot c(A)$, the total number of occurrences is $4 \cdot c(A)$.
  • Figure 3: If $2|\hat{B}|<|Q|\le|B|$, there are $\lceil|Q|/p\rceil$ primary occurrences around the boundary between any two blocks $B$ (we zoom on one) with the cut $P = R \mid Q$. We show the possible alignments of $P$ below the blocks $\hat{B}$. For a rule $A \rightarrow B^s$ there are $(s-1)$ boundaries, yielding $(s-1) \cdot \lceil|Q|/p\rceil$ primary occurrences. In this case, $\lceil|Q|/p\rceil = 3$ and $s - 1 = 3$, yielding $9$ primary occurrences.
  • Figure 4: If $|Q| > |B|$, we can compute all occurrences of $P$ around blocks $\hat{B}$ without the risk of any occurrence being fully contained in a block $B$: the number of primary occurrences of $P$ in $\exp(A)$ is simply $s' - \lceil|Q|/p\rceil$. In this example, with $s' = 8$ and $\lceil|Q|/p\rceil = 3$, there are 5 occurrences.
  • Figure 5: On top, a RLCFG on the left and its grammar tree on the right. Type-E rules are enclosed in white rectangles and Type-L rules in gray rectangles. Below the rules we show the values $C(B,s)$ and $C'(B,s)$christiansen2020optimal we use to handle the E-type rules (see Section \ref{['sec:prevcount']}); we only show those for $\exp(X_1)=\mathtt{cgta}$. On the bottom left we show the points we add to the standard grid. The points for type-E rules are represented as $A^{(c(A))}$ and $A^{((s-2) \cdot c(A))}$ and those for type-L rules as $A^{(-(s-1) \cdot c(A))}$ and $A^{(2 \cdot (s-1) \cdot c(A))}$. The bottom right shows the grid $G_{\pi}$ and the array $F_{\pi}$ for the transformed rules $A \rightarrow \hat{B}^{s'}$ where $\hat{B} = \pi = \texttt{cgta}$. In $F_\pi$ we show the fields $F[i].sum$. In $G_\pi$, the row labels show $B^{(|B|)}$ and the column labels show $s'$; the points show $A^{(C', C")}$. Consider the cut $P=\texttt{a} \mid \texttt{cgtacgtac}$, with $p(P)=4$. We identify $9$ occurrences in type-E rules: $4$ are found within the rule $X_9$ using the standard grid, while the remaining $5$ are determined via the values of $C(X_1, s)$ and $C'(X_1, s)$. These $5$ occurrences specifically arise within $\exp(X_2) = (\texttt{cgta})^4$. Similarly, in the type-L rules, we detect $14$ occurrences: $9$ occur within the rule $X_{11}$, identified using the $F_{\texttt{cgta}}$ array, and the remaining $5$ arise within $\exp(X_7) = (\texttt{cgta})^8$, captured using the $G_{\texttt{cgta}}$ grid. The final two occurrences of this cut are located using standard CFG rules at $\exp(S)[4\mathinner{.\,.} 13]$ ($X_1 \cdot X_2$) and $\exp(S)[111\mathinner{.\,.} 120]$ ($X_9 \cdot X_{11}$).

Theorems & Definitions (13)

  • Definition 1
  • Lemma 2: periodicity
  • Lemma 3
  • Definition 4
  • Definition 5
  • Lemma 7
  • Definition 8
  • Lemma 9
  • Lemma 10
  • Lemma 11
  • ...and 3 more