Table of Contents
Fetching ...

Faster two-dimensional pattern matching with $k$ mismatches

Jonas Ellert, Paweł Gawrychowski, Adam Górkiewicz, Tatiana Starikovskaya

TL;DR

A natural generalisation of the approximate pattern matching problem to two-dimensional strings, which are simply square arrays of characters, and provides a new insight into two-dimensional periodicity to improve on these 30-years old bounds.

Abstract

The classical pattern matching asks for locating all occurrences of one string, called the pattern, in another, called the text, where a string is simply a sequence of characters. Due to the potential practical applications, it is desirable to seek approximate occurrences, for example by bounding the number of mismatches. This problem has been extensively studied, and by now we have a good understanding of the best possible time complexity as a function of $n$ (length of the text), $m$ (length of the pattern), and $k$ (number of mismatches). In particular, we know that for $k=\mathcal{O}(\sqrt{m})$, we can achieve quasi-linear time complexity [Gawrychowski and Uznański, ICALP 2018]. We consider a natural generalisation of the approximate pattern matching problem to two-dimensional strings, which are simply square arrays of characters. The exact version of this problem has been extensively studied in the early 90s. While periodicity, which is the basic tool for one-dimensional pattern matching, admits a natural extension to two dimensions, it turns out to become significantly more challenging to work with, and it took some time until an alphabet-independent linear-time algorithm has been obtained by Galil and Park [SICOMP 1996]. In the approximate two-dimensional pattern matching, we are given a pattern of size $m\times m$ and a text of size $n\times n$, and ask for all locations in the text where the pattern matches with at most $k$ mismatches. The asymptotically fastest algorithm for this algorithm works in $\mathcal{O}(kn^{2})$ time [Amir and Landau, TCS 1991]. We provide a new insight into two-dimensional periodicity to improve on these 30-years old bounds. Our algorithm works in $\tilde{\mathcal{O}}((m^{2}+mk^{5/4})n^{2}/m^{2})$ time, which is $\tilde{\mathcal{O}}(n^{2})$ for $k=\mathcal{O}(m^{4/5})$.

Faster two-dimensional pattern matching with $k$ mismatches

TL;DR

A natural generalisation of the approximate pattern matching problem to two-dimensional strings, which are simply square arrays of characters, and provides a new insight into two-dimensional periodicity to improve on these 30-years old bounds.

Abstract

The classical pattern matching asks for locating all occurrences of one string, called the pattern, in another, called the text, where a string is simply a sequence of characters. Due to the potential practical applications, it is desirable to seek approximate occurrences, for example by bounding the number of mismatches. This problem has been extensively studied, and by now we have a good understanding of the best possible time complexity as a function of (length of the text), (length of the pattern), and (number of mismatches). In particular, we know that for , we can achieve quasi-linear time complexity [Gawrychowski and Uznański, ICALP 2018]. We consider a natural generalisation of the approximate pattern matching problem to two-dimensional strings, which are simply square arrays of characters. The exact version of this problem has been extensively studied in the early 90s. While periodicity, which is the basic tool for one-dimensional pattern matching, admits a natural extension to two dimensions, it turns out to become significantly more challenging to work with, and it took some time until an alphabet-independent linear-time algorithm has been obtained by Galil and Park [SICOMP 1996]. In the approximate two-dimensional pattern matching, we are given a pattern of size and a text of size , and ask for all locations in the text where the pattern matches with at most mismatches. The asymptotically fastest algorithm for this algorithm works in time [Amir and Landau, TCS 1991]. We provide a new insight into two-dimensional periodicity to improve on these 30-years old bounds. Our algorithm works in time, which is for .

Paper Structure

This paper contains 21 sections, 31 theorems, 56 equations, 5 figures.

Key Result

Theorem 1.1

Given a two-dimensional $m \times m$ pattern string $P$ and a two-dimensional $n \times n$ text string $T$ with $m \le n$, there is an algorithm that solves the $k$-mismatch problem in $\tilde{\mathcal{O}}((m^2 + mk^{5/4})n^2 / m^2)$ time.

Figures (5)

  • Figure 1: All the points in the polygon form a truncated tile and the thicker points form a truncated subtile.
  • Figure 2: Parallelogram grid.
  • Figure 3: Parallelogram cover of $T_\mathbf{a}$. The polygon in the center corresponds to $T_\mathbf{a}$. Red parallelograms are $\mathcal{O}(n/\ell)$-peripheral, blue coverable, and green belong to $\mathcal{I}$.
  • Figure 4: Partitioning of $F$.
  • Figure 5: Pattern partitioning.

Theorems & Definitions (93)

  • Theorem 1.1
  • Definition 2.1
  • Definition 2.2: Hamming distance
  • Definition 2.3: Shifting
  • Theorem 2.1
  • proof
  • Corollary 2.1: of Karloff1993
  • proof
  • Claim 2.1
  • proof
  • ...and 83 more