Table of Contents
Fetching ...

Pattern Matching with Mismatches and Wildcards

Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya

TL;DR

This work tackles approximate pattern matching with wildcards under the Hamming distance. It introduces the PILLAR model to unify algorithm design across standard, compressed, dynamic, and quantum settings, deriving a main algorithm that runs in $O\big(n+(D+k)(G+k)\cdot n/m\big)$. A key contribution is a fine-grained structural decomposition that yields efficient representations of $k$-mismatch occurrences as $\mathcal{O}((D+k)G)$ arithmetic progressions plus $\mathcal{O}((D+k)k)$ extra occurrences, alongside a simpler exact-pattern approach when $k=0$. The paper also proves a tight lower bound on the number of progressions via Behrend-style progression-free constructions, and demonstrates the broad applicability of the results across multiple computational settings with practical parameter regimes in mind.

Abstract

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern $P$ of length $m$ containing $D$ wildcards, a text $T$ of length $n$, and an integer $k$, our objective is to identify all fragments of $T$ within Hamming distance $k$ from $P$. Our primary contribution is an algorithm with runtime $O(n+(D+k)(G+k)\cdot n/m)$ for this problem. Here, $G \le D$ represents the number of maximal wildcard fragments in $P$. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when $D$, $G$, and $k$ are small relative to $n$. For instance, if $m = n/2$, $k=G=n^{2/5}$, and $D=n^{3/5}$, our algorithm operates in $O(n)$ time, surpassing the $Ω(n^{6/5})$ time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards ($k=0$), we present a much simpler algorithm with runtime $O(n+DG\cdot n/m)$ that clearly illustrates our main technical innovation: the utilisation of positions of $P$ that do not belong to any fragment of $P$ with a density of wildcards much larger than $D/m$ as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known $O(n\log m)$-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if $DG = o(m\log m)$. We complement our algorithmic results with a structural characterization of the $k$-mismatch occurrences of $P$. We demonstrate that in a text of length $O(m)$, these occurrences can be partitioned into $O((D+k)(G+k))$ arithmetic progressions. Additionally, we construct an infinite family of examples with $Ω((D+k)k)$ arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].

Pattern Matching with Mismatches and Wildcards

TL;DR

This work tackles approximate pattern matching with wildcards under the Hamming distance. It introduces the PILLAR model to unify algorithm design across standard, compressed, dynamic, and quantum settings, deriving a main algorithm that runs in . A key contribution is a fine-grained structural decomposition that yields efficient representations of -mismatch occurrences as arithmetic progressions plus extra occurrences, alongside a simpler exact-pattern approach when . The paper also proves a tight lower bound on the number of progressions via Behrend-style progression-free constructions, and demonstrates the broad applicability of the results across multiple computational settings with practical parameter regimes in mind.

Abstract

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern of length containing wildcards, a text of length , and an integer , our objective is to identify all fragments of within Hamming distance from . Our primary contribution is an algorithm with runtime for this problem. Here, represents the number of maximal wildcard fragments in . We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when , , and are small relative to . For instance, if , , and , our algorithm operates in time, surpassing the time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards (), we present a much simpler algorithm with runtime that clearly illustrates our main technical innovation: the utilisation of positions of that do not belong to any fragment of with a density of wildcards much larger than as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known -time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if . We complement our algorithmic results with a structural characterization of the -mismatch occurrences of . We demonstrate that in a text of length , these occurrences can be partitioned into arithmetic progressions. Additionally, we construct an infinite family of examples with arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].
Paper Structure (20 sections, 21 theorems, 1 table)

This paper contains 20 sections, 21 theorems, 1 table.

Key Result

Theorem 1

Let $S$ and $T$ be solid strings of respective lengths $m$ and $n \le 3m/2$. We can compute a representation of the $d$-mismatch occurrences of $S$ in $T$ using $\mathcal{O}(d^2\log\log d)$ time plus $\mathcal{O}(d^2)$PILLAR operations.

Theorems & Definitions (27)

  • Theorem 1: unified
  • Theorem 2
  • Corollary 3: folklore
  • Definition 6: Sparsifiers
  • Lemma 7
  • Corollary 8
  • Definition 9: Misperiods
  • Example 10
  • Lemma 10
  • Corollary 11
  • ...and 17 more