Table of Contents
Fetching ...

Text Indexing and Pattern Matching with Ephemeral Edits

Solon P. Pissis

TL;DR

The paper addresses pattern matching and text indexing when the text undergoes ephemeral edits, where edits are transient and immediately reverted before the next operation. It develops practical data-structures enabling fast reporting of pattern occurrences after each ephemeral edit, achieving $\mathcal{O}(n)$ preprocessing for the text and $\mathcal{O}(m\log\log m)$ time with $\mathcal{O}(m)$ space to preprocess a pattern of length $m$, with $\mathcal{O}(\log\log n+\text{Occ})$ per update and occurrence reporting. A parallel framework for pattern matching with ephemeral edits delivers $\mathcal{O}(n)$-time preprocessing for $T$ and $P$, and $\mathcal{O}(\text{Occ})$ time reporting, including optimal handling of ephemeral block deletions and ephemeral substring edits. The results rely on a blend of suffix-tree based preprocessing, prefix-suffix queries, and efficient predecessor data structures, delivering simple-to-implement, near-optimal solutions suitable for scenarios like testing hypothetical edits or pangenomic variant analyses with transient queries.

Abstract

A sequence $e_0,e_1,\ldots$ of edit operations in a string $T$ is called ephemeral if operation $e_i$ constructing string $T^i$, for all $i=2k$ with $k\in\mathbb{N}$, is reverted by operation $e_{i+1}$ that reconstructs $T$. Such a sequence arises when processing a stream of independent edits or testing hypothetical edits. We introduce text indexing with ephemeral substring edits, a new version of text indexing. Our goal is to design a data structure over a given text that supports subsequent pattern matching queries with ephemeral substring insertions, deletions, or substitutions in the text; we require insertions and substitutions to be of constant length. In particular, we preprocess a text $T=T[0\mathinner{.\,.} n)$ over an integer alphabet $Σ=[0,σ)$ with $σ=n^{\mathcal{O}(1)}$ in $\mathcal{O}(n)$ time. Then, we can preprocess any arbitrary pattern $P=P[0\mathinner{.\,.} m)$ given online in $\mathcal{O}(m\log\log m)$ time and $\mathcal{O}(m)$ space and allow any ephemeral sequence of edit operations in $T$. Before reverting the $i$th operation, we report all Occ occurrences of $P$ in $T^i$ in $\mathcal{O}(\log\log n + \text{Occ})$ time. We also introduce pattern matching with ephemeral edits. In particular, we preprocess two strings $T$ and $P$, each of length at most $n$, over an integer alphabet $Σ=[0,σ)$ with $σ=n^{\mathcal{O}(1)}$ in $\mathcal{O}(n)$ time. Then, we allow any ephemeral sequence of edit operations in $T$. Before reverting the $i$th operation, we report all Occ occurrences of $P$ in $T^i$ in the optimal $\mathcal{O}(\text{Occ})$ time. Along our way to this result, we also give an optimal solution for pattern matching with ephemeral block deletions.

Text Indexing and Pattern Matching with Ephemeral Edits

TL;DR

The paper addresses pattern matching and text indexing when the text undergoes ephemeral edits, where edits are transient and immediately reverted before the next operation. It develops practical data-structures enabling fast reporting of pattern occurrences after each ephemeral edit, achieving preprocessing for the text and time with space to preprocess a pattern of length , with per update and occurrence reporting. A parallel framework for pattern matching with ephemeral edits delivers -time preprocessing for and , and time reporting, including optimal handling of ephemeral block deletions and ephemeral substring edits. The results rely on a blend of suffix-tree based preprocessing, prefix-suffix queries, and efficient predecessor data structures, delivering simple-to-implement, near-optimal solutions suitable for scenarios like testing hypothetical edits or pangenomic variant analyses with transient queries.

Abstract

A sequence of edit operations in a string is called ephemeral if operation constructing string , for all with , is reverted by operation that reconstructs . Such a sequence arises when processing a stream of independent edits or testing hypothetical edits. We introduce text indexing with ephemeral substring edits, a new version of text indexing. Our goal is to design a data structure over a given text that supports subsequent pattern matching queries with ephemeral substring insertions, deletions, or substitutions in the text; we require insertions and substitutions to be of constant length. In particular, we preprocess a text over an integer alphabet with in time. Then, we can preprocess any arbitrary pattern given online in time and space and allow any ephemeral sequence of edit operations in . Before reverting the th operation, we report all Occ occurrences of in in time. We also introduce pattern matching with ephemeral edits. In particular, we preprocess two strings and , each of length at most , over an integer alphabet with in time. Then, we allow any ephemeral sequence of edit operations in . Before reverting the th operation, we report all Occ occurrences of in in the optimal time. Along our way to this result, we also give an optimal solution for pattern matching with ephemeral block deletions.

Paper Structure

This paper contains 26 sections, 4 theorems, 8 figures.

Key Result

Theorem 1

Given a text $T$ of length $n$ over an integer alphabet $\Sigma=[0,\sigma)$ with $\sigma=n^{\mathcal{O}(1)}$, we can preprocess it in $\mathcal{O}(n)$ time and space to support the following:

Figures (8)

  • Figure 1: The suffix tree ST of $T=T[0\mathinner{.\,.} 17)=\texttt{ananabannabanaana}$ with suffix links (dotted). The label of edges leading to leaf nodes is truncated after the first letter to avoid cluttering the figure. The suffix array SA of $T$ is $[16,13,9,4,14,11,2,0,6,10,5,15,12,8,3,1,7]$ and is inferred from ST using an in-order bottom-up traversal. The node labeled 16, which represents suffix $T[16\mathinner{.\,.} 17)$, stores also the SA interval $[0,8]$. The node spelling string $\texttt{ban}$ from the root stores the SA interval $[9,10]$.
  • Figure 2: $\textsf{ST}(P)$ and $\textsf{TREE}(P)$ for $P=P[0\mathinner{.\,.} 11)=\texttt{mississippi}$. The label of edges leading to leaf nodes in $\textsf{ST}(P)$ is truncated after the first letter to avoid cluttering the figure. The labels are omitted in $\textsf{TREE}(P)$.
  • Figure 3: $\textsf{TREE}(P)$ for $P=\texttt{banana}$ with the intervals from the SA of $T=\texttt{ananabannabanaana}$. Note that the suffix $\texttt{banana}$ of $P$ has no interval stored because it does not occur in $T$.
  • Figure 4: $\textsf{TREE}_c(P)$ obtained from $\textsf{TREE}(P)$ for $P=\texttt{banana}$ and $c\in\{\texttt{a},\texttt{b},\texttt{n}\}$ with the intervals from the SA of $T=\texttt{ananabannabanaana}$ made disjoint.
  • Figure 5: $\textsf{TREE}_S(P)$ obtained from $\textsf{TREE}(P)$ for $P=\texttt{banana}$ and $S\in\{\texttt{an},\texttt{ba},\texttt{na}\}$ with the intervals from the SA of $T=\texttt{ananabannabanaana}$ made disjoint.
  • ...and 3 more figures

Theorems & Definitions (11)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4: DBLP:conf/sosa/Pissis25
  • Example 1
  • Example 2
  • Example 3: Deletion
  • Example 4: Insertion
  • Example 5: Insertion of Substring
  • Example 6
  • ...and 1 more