Pattern Matching with Mismatches and Wildcards
Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya
TL;DR
This work tackles approximate pattern matching with wildcards under the Hamming distance. It introduces the PILLAR model to unify algorithm design across standard, compressed, dynamic, and quantum settings, deriving a main algorithm that runs in $O\big(n+(D+k)(G+k)\cdot n/m\big)$. A key contribution is a fine-grained structural decomposition that yields efficient representations of $k$-mismatch occurrences as $\mathcal{O}((D+k)G)$ arithmetic progressions plus $\mathcal{O}((D+k)k)$ extra occurrences, alongside a simpler exact-pattern approach when $k=0$. The paper also proves a tight lower bound on the number of progressions via Behrend-style progression-free constructions, and demonstrates the broad applicability of the results across multiple computational settings with practical parameter regimes in mind.
Abstract
In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern $P$ of length $m$ containing $D$ wildcards, a text $T$ of length $n$, and an integer $k$, our objective is to identify all fragments of $T$ within Hamming distance $k$ from $P$. Our primary contribution is an algorithm with runtime $O(n+(D+k)(G+k)\cdot n/m)$ for this problem. Here, $G \le D$ represents the number of maximal wildcard fragments in $P$. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when $D$, $G$, and $k$ are small relative to $n$. For instance, if $m = n/2$, $k=G=n^{2/5}$, and $D=n^{3/5}$, our algorithm operates in $O(n)$ time, surpassing the $Ω(n^{6/5})$ time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards ($k=0$), we present a much simpler algorithm with runtime $O(n+DG\cdot n/m)$ that clearly illustrates our main technical innovation: the utilisation of positions of $P$ that do not belong to any fragment of $P$ with a density of wildcards much larger than $D/m$ as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known $O(n\log m)$-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if $DG = o(m\log m)$. We complement our algorithmic results with a structural characterization of the $k$-mismatch occurrences of $P$. We demonstrate that in a text of length $O(m)$, these occurrences can be partitioned into $O((D+k)(G+k))$ arithmetic progressions. Additionally, we construct an infinite family of examples with $Ω((D+k)k)$ arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].
