Table of Contents
Fetching ...

Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

Anouk Duyster, Tomasz Kociumaka

TL;DR

The paper tackles efficient IPM queries on compressed and dynamic texts by leveraging restricted recompression-based RLSLPs and a novel popped-sequence framework. It constructs a proxy pattern and a proxy text to reduce IPM to a small number of run-length pattern-matching and LCE checks, all within $O(r)$ time per query for an $r$-round representation. When instantiated on optimal RLSLPs of size $Oigl( rac{ abla ext{log}(n ext{log }\sigma)}{ abla ext{log}n}igr)$ and with $ abla$ representing substring complexity, the approach yields IPM query time $O( ext{log } n)$ and space $Oigl( abla ext{log} rac{n ext{log }\sigma}{ abla ext{log} n}igr)$; preprocessing from LZ77 can be achieved in $Oigl( abla ext{log}^7 nigr)$ time. In the dynamic setting, the method integrates with fully persistent updates in $O( ext{log } N)$ time w.h.p., enabling a PILLAR-model implementation with $O( ext{log } N)$ time per operation. These contributions significantly accelerate IPM-based fragment queries in compressed and dynamic texts, supporting fast approximate pattern matching and related tasks on compressed representations.

Abstract

Internal Pattern Matching (IPM) queries on a text $T$, given two fragments $X$ and $Y$ of $T$ such that $|Y|<2|X|$, ask to compute all exact occurrences of $X$ within $Y$. IPM queries have been introduced by Kociumaka, Radoszewski, Rytter, and Waleń [SODA'15 & SICOMP'24], who showed that they can be answered in $O(1)$ time using a data structure of size $O(n)$ and used this result to answer various queries about fragments of $T$. In this work, we study IPM queries on compressed and dynamic strings. Our result is an $O(\log n)$-time query algorithm applicable to any balanced recompression-based run-length straight-line program (RLSLP). In particular, one can use it on top of the RLSLP of Kociumaka, Navarro, and Prezza [IEEE TIT'23], whose size $O\big(δ\log \frac{n\log σ}{δ\log n}\big)$ is optimal (among all text representations) as a function of the text length $n$, the alphabet size $σ$, and the substring complexity $δ$. Our procedure does not rely on any preprocessing of the underlying RLSLP, which makes it readily applicable on top of the dynamic strings data structure of Gawrychowski, Karczmarz, Kociumaka, Łącki and Sankowski [SODA'18], which supports fully persistent updates in logarithmic time with high probability.

Logarithmic-Time Internal Pattern Matching Queries in Compressed and Dynamic Texts

TL;DR

The paper tackles efficient IPM queries on compressed and dynamic texts by leveraging restricted recompression-based RLSLPs and a novel popped-sequence framework. It constructs a proxy pattern and a proxy text to reduce IPM to a small number of run-length pattern-matching and LCE checks, all within time per query for an -round representation. When instantiated on optimal RLSLPs of size and with representing substring complexity, the approach yields IPM query time and space ; preprocessing from LZ77 can be achieved in time. In the dynamic setting, the method integrates with fully persistent updates in time w.h.p., enabling a PILLAR-model implementation with time per operation. These contributions significantly accelerate IPM-based fragment queries in compressed and dynamic texts, supporting fast approximate pattern matching and related tasks on compressed representations.

Abstract

Internal Pattern Matching (IPM) queries on a text , given two fragments and of such that , ask to compute all exact occurrences of within . IPM queries have been introduced by Kociumaka, Radoszewski, Rytter, and Waleń [SODA'15 & SICOMP'24], who showed that they can be answered in time using a data structure of size and used this result to answer various queries about fragments of . In this work, we study IPM queries on compressed and dynamic strings. Our result is an -time query algorithm applicable to any balanced recompression-based run-length straight-line program (RLSLP). In particular, one can use it on top of the RLSLP of Kociumaka, Navarro, and Prezza [IEEE TIT'23], whose size is optimal (among all text representations) as a function of the text length , the alphabet size , and the substring complexity . Our procedure does not rely on any preprocessing of the underlying RLSLP, which makes it readily applicable on top of the dynamic strings data structure of Gawrychowski, Karczmarz, Kociumaka, Łącki and Sankowski [SODA'18], which supports fully persistent updates in logarithmic time with high probability.

Paper Structure

This paper contains 9 sections, 13 theorems, 9 equations, 3 figures.

Key Result

theorem 1

IPM queries on a text $T$ represented using a (restricted) $r$-round recompression run-length straight-line program can be answered in $\mathcal{O}(r)$ time.

Figures (3)

  • Figure 1: The popped sequence is build from the blocks $L_0$ through $R_0$. In every level $k$, the string $\bar{X}_k$ spans from $L_k$ to $R_k$ (inclusive).
  • Figure 2: The queries that compute $\overline{u}$ and $u$ check how far the period $g$ of $\exp(\bar{X}_\ell)=X\bm{[}\,\overline{c}\,\bm{.\,.}\,c\,\bm{)}$ extends within $X$.
  • Figure 3: The thick black lines are the detected occurrences of $\exp(\bar{X}_\ell)$ in $Y$; these occurrences start $g$ positions apart and their union is $Y\bm{[}\,\overline{a}\,\bm{.\,.}\,a\,\bm{)}$. The queries that compute $\overline{v}$ and $v$ check how far the period $g$ of $Y\bm{[}\,\overline{a}\,\bm{.\,.}\,a\,\bm{)}$ extends within $Y$.

Theorems & Definitions (31)

  • theorem 1
  • corollary 1
  • corollary 2
  • definition 1: Restricted run-length encoding KRRW23KNP23
  • remark 1
  • definition 2: Restricted pair compression KRRW23KNP23
  • remark 2
  • definition 3: Restricted recompression KRRW23KNP23
  • proof
  • proof
  • ...and 21 more