Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

Anouk Duyster; Tomasz Kociumaka

Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

Anouk Duyster, Tomasz Kociumaka

TL;DR

The paper tackles efficient IPM queries on compressed and dynamic texts by leveraging restricted recompression-based RLSLPs and a novel popped-sequence framework. It constructs a proxy pattern and a proxy text to reduce IPM to a small number of run-length pattern-matching and LCE checks, all within $O(r)$ time per query for an $r$-round representation. When instantiated on optimal RLSLPs of size $Oigl(rac{ abla ext{log}(n ext{log }\sigma)}{ abla ext{log}n}igr)$ and with $ abla$ representing substring complexity, the approach yields IPM query time $O( ext{log } n)$ and space $Oigl( abla ext{log}rac{n ext{log }\sigma}{ abla ext{log} n}igr)$; preprocessing from LZ77 can be achieved in $Oigl( abla ext{log}^7 nigr)$ time. In the dynamic setting, the method integrates with fully persistent updates in $O( ext{log } N)$ time w.h.p., enabling a PILLAR-model implementation with $O( ext{log } N)$ time per operation. These contributions significantly accelerate IPM-based fragment queries in compressed and dynamic texts, supporting fast approximate pattern matching and related tasks on compressed representations.

Abstract

Internal Pattern Matching (IPM) queries on a text $T$, given two fragments $X$ and $Y$ of $T$ such that $|Y|<2|X|$, ask to compute all exact occurrences of $X$ within $Y$. IPM queries have been introduced by Kociumaka, Radoszewski, Rytter, and Waleń [SODA'15 & SICOMP'24], who showed that they can be answered in $O(1)$ time using a data structure of size $O(n)$ and used this result to answer various queries about fragments of $T$. In this work, we study IPM queries on compressed and dynamic strings. Our result is an $O(\log n)$-time query algorithm applicable to any balanced recompression-based run-length straight-line program (RLSLP). In particular, one can use it on top of the RLSLP of Kociumaka, Navarro, and Prezza [IEEE TIT'23], whose size $O\big(δ\log \frac{n\log σ}{δ\log n}\big)$ is optimal (among all text representations) as a function of the text length $n$, the alphabet size $σ$, and the substring complexity $δ$. Our procedure does not rely on any preprocessing of the underlying RLSLP, which makes it readily applicable on top of the dynamic strings data structure of Gawrychowski, Karczmarz, Kociumaka, Łącki and Sankowski [SODA'18], which supports fully persistent updates in logarithmic time with high probability.

Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

TL;DR

time per query for an

-round representation. When instantiated on optimal RLSLPs of size

and with

representing substring complexity, the approach yields IPM query time

and space

; preprocessing from LZ77 can be achieved in

time. In the dynamic setting, the method integrates with fully persistent updates in

time w.h.p., enabling a PILLAR-model implementation with

time per operation. These contributions significantly accelerate IPM-based fragment queries in compressed and dynamic texts, supporting fast approximate pattern matching and related tasks on compressed representations.

Abstract

Internal Pattern Matching (IPM) queries on a text

, given two fragments

and

such that

, ask to compute all exact occurrences of

within

. IPM queries have been introduced by Kociumaka, Radoszewski, Rytter, and Waleń [SODA'15 & SICOMP'24], who showed that they can be answered in

time using a data structure of size

and used this result to answer various queries about fragments of

. In this work, we study IPM queries on compressed and dynamic strings. Our result is an

-time query algorithm applicable to any balanced recompression-based run-length straight-line program (RLSLP). In particular, one can use it on top of the RLSLP of Kociumaka, Navarro, and Prezza [IEEE TIT'23], whose size

is optimal (among all text representations) as a function of the text length

, the alphabet size

, and the substring complexity

. Our procedure does not rely on any preprocessing of the underlying RLSLP, which makes it readily applicable on top of the dynamic strings data structure of Gawrychowski, Karczmarz, Kociumaka, Łącki and Sankowski [SODA'18], which supports fully persistent updates in logarithmic time with high probability.

Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

TL;DR

Abstract

Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (31)