Table of Contents
Fetching ...

On the Communication Complexity of Approximate Pattern Matching

Tomasz Kociumaka, Jakob Nogler, Philip Wellnitz

TL;DR

This work proves an upper bound of O(n/m · k log2 m) bits, thus establishing the optimal communication complexity up to logarithmic factors, and demonstrates a quantum algorithm that uses O(n1+o(1)/m · √km) queries and O(n1+o(1)/m · (√km + k3.5)) quantum time.

Abstract

The decades-old Pattern Matching with Edits problem, given a length-$n$ string $T$ (the text), a length-$m$ string $P$ (the pattern), and a positive integer $k$ (the threshold), asks to list all fragments of $T$ that are at edit distance at most $k$ from $P$. The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved without accessing the input strings $P$ and $T$. The closely related Pattern Matching with Mismatches problem (defined in terms of the Hamming distance instead of the edit distance) is already well understood from the communication complexity perspective: Clifford, Kociumaka, and Porat [SODA 2019] proved that $Ω(n/m \cdot k \log(m/k))$ bits are necessary and $O(n/m \cdot k\log (m|Σ|/k))$ bits are sufficient; the upper bound allows encoding not only the occurrences of $P$ in $T$ with at most $k$ mismatches but also the substitutions needed to make each $k$-mismatch occurrence exact. Despite recent improvements in the running time [Charalampopoulos, Kociumaka, and Wellnitz; FOCS 2020 and 2022], the communication complexity of Pattern Matching with Edits remained unexplored, with a lower bound of $Ω(n/m \cdot k\log(m/k))$ bits and an upper bound of $O(n/m \cdot k^3\log m)$ bits stemming from previous research. In this work, we prove an upper bound of $O(n/m \cdot k \log^2 m)$ bits, thus establishing the optimal communication complexity up to logarithmic factors. We also show that $O(n/m \cdot k \log m \log (m|Σ|))$ bits allow encoding, for each $k$-error occurrence of $P$ in $T$, the shortest sequence of edits needed to make the occurrence exact. We leverage the techniques behind our new result on the communication complexity to obtain quantum algorithms for Pattern Matching with Edits.

On the Communication Complexity of Approximate Pattern Matching

TL;DR

This work proves an upper bound of O(n/m · k log2 m) bits, thus establishing the optimal communication complexity up to logarithmic factors, and demonstrates a quantum algorithm that uses O(n1+o(1)/m · √km) queries and O(n1+o(1)/m · (√km + k3.5)) quantum time.

Abstract

The decades-old Pattern Matching with Edits problem, given a length- string (the text), a length- string (the pattern), and a positive integer (the threshold), asks to list all fragments of that are at edit distance at most from . The one-way communication complexity of this problem is the minimum amount of space needed to encode the answer so that it can be retrieved without accessing the input strings and . The closely related Pattern Matching with Mismatches problem (defined in terms of the Hamming distance instead of the edit distance) is already well understood from the communication complexity perspective: Clifford, Kociumaka, and Porat [SODA 2019] proved that bits are necessary and bits are sufficient; the upper bound allows encoding not only the occurrences of in with at most mismatches but also the substitutions needed to make each -mismatch occurrence exact. Despite recent improvements in the running time [Charalampopoulos, Kociumaka, and Wellnitz; FOCS 2020 and 2022], the communication complexity of Pattern Matching with Edits remained unexplored, with a lower bound of bits and an upper bound of bits stemming from previous research. In this work, we prove an upper bound of bits, thus establishing the optimal communication complexity up to logarithmic factors. We also show that bits allow encoding, for each -error occurrence of in , the shortest sequence of edits needed to make the occurrence exact. We leverage the techniques behind our new result on the communication complexity to obtain quantum algorithms for Pattern Matching with Edits.
Paper Structure (12 sections, 7 theorems, 8 equations, 5 figures)

This paper contains 12 sections, 7 theorems, 8 equations, 5 figures.

Key Result

proposition 0

Let $\mathcal{X} : P \begin{tikzpicture}[baseline=(a.base)]{\draw[ decorate,decoration={zigzag,segment length=4, amplitude=.9}, ] (0,0) -- (.25, 0); \draw[ -{Classical TikZ Rightarrow}.{Classical TikZ Rightarrow}, ] (.25, 0) -- (.4, 0); \node (a) at (.4/2,-.03) {\phantom{\(\leadsto

Figures (5)

  • Figure 1: The structure of occurrences of exact pattern matching is easy: either all exact occurrences of $P$ in $T$ form an arithmetic progression or there is just one such occurrence (which we may also view as a degenerate arithmetic progression). Depicted is a text $T$ and exact occurrences starting at the positions denoted above the text; we may assume that there is an occurrence that starts at position $0$ and that there is an occurrence that ends at position $|T|-1$.
  • Figure 2: Compare \ref{['fig1']}: we fully understand the easy structure of exact pattern matching. In this figure, we reinterpret our knowledge in terms of the encoding scheme of Alice for Pattern Matching with Mismatches (in particular we show just the occurrences included in the set $S$) and showcase how the corresponding graph $\mathbf{G}_{S}$ and its black components evolve. We connect the same positions in $P$, as well as pairs of positions that are aligned by an occurrence of $P$ in $T$. As there are no mismatches, every such line implies that the connected characters are equal. For each connected component of the resulting graph (a black component), we know that all involved positions in $P$ and $T$ must have the same symbol. For illustrative purposes, we assume that $x = 3$ and we replace each character of a black component with a sentinel character (unique to that component), that is, we depict the strings $P^\#$ and $T^\#$.
  • Figure 3: Compared to \ref{['fig2']}, we now have characters in $P$ and $T$ that mismatch. Again, we showcase how the corresponding graph $\mathbf{G}_{S}$ and its black components evolve; in the example, we allow for up to $k = 3$ mismatches. Again, for illustrative purposes, we assume that $x = 3$ and we replace each character of a black component with a sentinel character (unique to that component), that is, we depict the strings $P^\#$ and $T^\#$.
  • Figure 4: Compare \ref{['fig2', 'fig3']}. In addition to mismatches, we now also allow character insertions or deletions. In the example, we depict occurrence with at most $k = 4$ edits.
  • Figure 5: The LZ77 factorization of a string $X = \mathtt{abacabcabcaaaab}$ of length $n = 15$. The resulting encoding has $z=8$ elements: $(\mathtt{a}, 0)$, $(\mathtt{b}, 0)$, $(0,1)$, $(\mathtt{c}, 0)$, $(0, 2)$, $(3, 5)$, $(10, 3)$, $(8,1)$.

Theorems & Definitions (9)

  • proposition 0
  • Lemma 0: CKW20
  • Lemma 1: Periodicity Lemma FW65
  • lemma 2
  • definition 3: CKW22
  • definition 5: Edit information
  • lemma 6: CKW22
  • lemma 7
  • Lemma 8: Properties of $\mathsf{self}\text{-}\delta_E$, CKW23