Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts
Anouk Duyster, Tomasz Kociumaka
TL;DR
The paper tackles efficient IPM queries on compressed and dynamic texts by leveraging restricted recompression-based RLSLPs and a novel popped-sequence framework. It constructs a proxy pattern and a proxy text to reduce IPM to a small number of run-length pattern-matching and LCE checks, all within $O(r)$ time per query for an $r$-round representation. When instantiated on optimal RLSLPs of size $Oigl(rac{ abla ext{log}(n ext{log }\sigma)}{ abla ext{log}n}igr)$ and with $ abla$ representing substring complexity, the approach yields IPM query time $O( ext{log } n)$ and space $Oigl( abla ext{log}rac{n ext{log }\sigma}{ abla ext{log} n}igr)$; preprocessing from LZ77 can be achieved in $Oigl( abla ext{log}^7 nigr)$ time. In the dynamic setting, the method integrates with fully persistent updates in $O( ext{log } N)$ time w.h.p., enabling a PILLAR-model implementation with $O( ext{log } N)$ time per operation. These contributions significantly accelerate IPM-based fragment queries in compressed and dynamic texts, supporting fast approximate pattern matching and related tasks on compressed representations.
Abstract
Internal Pattern Matching (IPM) queries on a text $T$, given two fragments $X$ and $Y$ of $T$ such that $|Y|<2|X|$, ask to compute all exact occurrences of $X$ within $Y$. IPM queries have been introduced by Kociumaka, Radoszewski, Rytter, and Waleń [SODA'15 & SICOMP'24], who showed that they can be answered in $O(1)$ time using a data structure of size $O(n)$ and used this result to answer various queries about fragments of $T$. In this work, we study IPM queries on compressed and dynamic strings. Our result is an $O(\log n)$-time query algorithm applicable to any balanced recompression-based run-length straight-line program (RLSLP). In particular, one can use it on top of the RLSLP of Kociumaka, Navarro, and Prezza [IEEE TIT'23], whose size $O\big(δ\log \frac{n\log σ}{δ\log n}\big)$ is optimal (among all text representations) as a function of the text length $n$, the alphabet size $σ$, and the substring complexity $δ$. Our procedure does not rely on any preprocessing of the underlying RLSLP, which makes it readily applicable on top of the dynamic strings data structure of Gawrychowski, Karczmarz, Kociumaka, Łącki and Sankowski [SODA'18], which supports fully persistent updates in logarithmic time with high probability.
