Pattern Matching with Mismatches and Wildcards

Gabriel Bathie; Panagiotis Charalampopoulos; Tatiana Starikovskaya

Pattern Matching with Mismatches and Wildcards

Gabriel Bathie, Panagiotis Charalampopoulos, Tatiana Starikovskaya

TL;DR

This work tackles approximate pattern matching with wildcards under the Hamming distance. It introduces the PILLAR model to unify algorithm design across standard, compressed, dynamic, and quantum settings, deriving a main algorithm that runs in $O\big(n+(D+k)(G+k)\cdot n/m\big)$. A key contribution is a fine-grained structural decomposition that yields efficient representations of $k$-mismatch occurrences as $\mathcal{O}((D+k)G)$ arithmetic progressions plus $\mathcal{O}((D+k)k)$ extra occurrences, alongside a simpler exact-pattern approach when $k=0$. The paper also proves a tight lower bound on the number of progressions via Behrend-style progression-free constructions, and demonstrates the broad applicability of the results across multiple computational settings with practical parameter regimes in mind.

Abstract

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern $P$ of length $m$ containing $D$ wildcards, a text $T$ of length $n$, and an integer $k$, our objective is to identify all fragments of $T$ within Hamming distance $k$ from $P$. Our primary contribution is an algorithm with runtime $O(n+(D+k)(G+k)\cdot n/m)$ for this problem. Here, $G \le D$ represents the number of maximal wildcard fragments in $P$. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when $D$, $G$, and $k$ are small relative to $n$. For instance, if $m = n/2$, $k=G=n^{2/5}$, and $D=n^{3/5}$, our algorithm operates in $O(n)$ time, surpassing the $Ω(n^{6/5})$ time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards ($k=0$), we present a much simpler algorithm with runtime $O(n+DG\cdot n/m)$ that clearly illustrates our main technical innovation: the utilisation of positions of $P$ that do not belong to any fragment of $P$ with a density of wildcards much larger than $D/m$ as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known $O(n\log m)$-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if $DG = o(m\log m)$. We complement our algorithmic results with a structural characterization of the $k$-mismatch occurrences of $P$. We demonstrate that in a text of length $O(m)$, these occurrences can be partitioned into $O((D+k)(G+k))$ arithmetic progressions. Additionally, we construct an infinite family of examples with $Ω((D+k)k)$ arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].

Pattern Matching with Mismatches and Wildcards

TL;DR

. A key contribution is a fine-grained structural decomposition that yields efficient representations of

-mismatch occurrences as

arithmetic progressions plus

extra occurrences, alongside a simpler exact-pattern approach when

. The paper also proves a tight lower bound on the number of progressions via Behrend-style progression-free constructions, and demonstrates the broad applicability of the results across multiple computational settings with practical parameter regimes in mind.

Abstract

In this work, we address the problem of approximate pattern matching with wildcards. Given a pattern

of length

containing

wildcards, a text

of length

, and an integer

, our objective is to identify all fragments of

within Hamming distance

from

. Our primary contribution is an algorithm with runtime

for this problem. Here,

represents the number of maximal wildcard fragments in

. We derive this algorithm by elaborating in a non-trivial way on the ideas presented by [Charalampopoulos et al., FOCS'20] for pattern matching with mismatches (without wildcards). Our algorithm improves over the state of the art when

, and

are small relative to

. For instance, if

, and

, our algorithm operates in

time, surpassing the

time requirement of all previously known algorithms. In the case of exact pattern matching with wildcards (

), we present a much simpler algorithm with runtime

that clearly illustrates our main technical innovation: the utilisation of positions of

that do not belong to any fragment of

with a density of wildcards much larger than

as anchors for the sought (approximate) occurrences. Notably, our algorithm outperforms the best-known

-time FFT-based algorithms of [Cole and Hariharan, STOC'02] and [Clifford and Clifford, IPL'04] if

. We complement our algorithmic results with a structural characterization of the

-mismatch occurrences of

. We demonstrate that in a text of length

, these occurrences can be partitioned into

arithmetic progressions. Additionally, we construct an infinite family of examples with

arithmetic progressions of occurrences, leveraging a combinatorial result on progression-free sets [Elkin, SODA'10].

Paper Structure (20 sections, 21 theorems, 1 table)

This paper contains 20 sections, 21 theorems, 1 table.

Introduction
Multi-framework algorithms with the PILLAR model.
The standard trick.
Reduction to pattern matching with mismatches.
Our results.
Technical overview.
Organisation of the paper.
Preliminaries
The PILLAR model.
Exact Pattern Matching in the PILLAR Model
Pattern Matching with k Mismatches in the PILLAR Model
Computing Structure in the Pattern
The Almost Periodic Case
The Remaining Cases
Proof of the algorithmic part of \ref{['thm:ham-pm']}.
...and 5 more sections

Key Result

Theorem 1

Let $S$ and $T$ be solid strings of respective lengths $m$ and $n \le 3m/2$. We can compute a representation of the $d$-mismatch occurrences of $S$ in $T$ using $\mathcal{O}(d^2\log\log d)$ time plus $\mathcal{O}(d^2)$PILLAR operations.

Theorems & Definitions (27)

Theorem 1: unified
Theorem 2
Corollary 3: folklore
Definition 6: Sparsifiers
Lemma 7
Corollary 8
Definition 9: Misperiods
Example 10
Lemma 10
Corollary 11
...and 17 more

Pattern Matching with Mismatches and Wildcards

TL;DR

Abstract

Pattern Matching with Mismatches and Wildcards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (27)