Table of Contents
Fetching ...

Algorithms for Parameterized String Matching with Mismatches

Apurba Saha, Iftekhar Hakim Kaowsar, Mahdi Hasnat Siyam, M. Sohel Rahman

TL;DR

This work tackles parameterized string matching with mismatches by presenting two independent approaches: a deterministic algorithm for general mismatch tolerance that uses FFT-based symbol-weight computations and per-alignment maximum weighted bipartite matchings to bound mismatches, achieving a time bound of $O(|t| \cdot |\Sigma|^2 \sqrt{|\Sigma|} \log(|t| \cdot |\Sigma|))$; and a probabilistic hashing-based algorithm for the single-mismatch case that runs in $O(|t| \log |t|)$ time, with collision probabilities analyzed and mitigated via double hashing. The deterministic method encodes parameterized strings, reduces the problem to static matching via a sequence of convolutions, and supports parallelization to accelerate computation. The single-mismatch approach uses polynomial hashing and a segment tree to locate the first mismatch efficiently, improving to $O(|t| \log |t|)$ by descending the tree instead of binary search, with empirical collision analysis supporting practical deployment. Overall, the paper advances fast parameterized matching for general mismatch tolerance and provides a practical, faster hashing-based solution for the single-mismatch case, with clear paths to parallelization and future refinements.

Abstract

Two strings are considered to have parameterized matching when there exists a bijection of the parameterized alphabet onto itself such that it transforms one string to another. Parameterized matching has application in software duplication detection, image processing, and computational biology. We consider the problem for which a pattern $p$, a text $t$ and a mismatch tolerance limit $k$ is given and the goal is to find all positions in text $t$, for which pattern $p$, parameterized matches with $|p|$ length substrings of $t$ with at most $k$ mismatches. Our main result is an algorithm for this problem with $O(α^2 n\log n + n α^2 \sqrtα \log \left( n α\right))$ time complexity, where $n = |t|$ and $α= |Σ|$ which is improving for $k=\tildeΩ(|Σ|^{5/3})$ the algorithm by Hazay, Lewenstein and Sokol. We also present a hashing based probabilistic algorithm for this problem when $k = 1$ with $O \left( n \log n \right)$ time complexity, which we believe is algorithmically beautiful.

Algorithms for Parameterized String Matching with Mismatches

TL;DR

This work tackles parameterized string matching with mismatches by presenting two independent approaches: a deterministic algorithm for general mismatch tolerance that uses FFT-based symbol-weight computations and per-alignment maximum weighted bipartite matchings to bound mismatches, achieving a time bound of ; and a probabilistic hashing-based algorithm for the single-mismatch case that runs in time, with collision probabilities analyzed and mitigated via double hashing. The deterministic method encodes parameterized strings, reduces the problem to static matching via a sequence of convolutions, and supports parallelization to accelerate computation. The single-mismatch approach uses polynomial hashing and a segment tree to locate the first mismatch efficiently, improving to by descending the tree instead of binary search, with empirical collision analysis supporting practical deployment. Overall, the paper advances fast parameterized matching for general mismatch tolerance and provides a practical, faster hashing-based solution for the single-mismatch case, with clear paths to parallelization and future refinements.

Abstract

Two strings are considered to have parameterized matching when there exists a bijection of the parameterized alphabet onto itself such that it transforms one string to another. Parameterized matching has application in software duplication detection, image processing, and computational biology. We consider the problem for which a pattern , a text and a mismatch tolerance limit is given and the goal is to find all positions in text , for which pattern , parameterized matches with length substrings of with at most mismatches. Our main result is an algorithm for this problem with time complexity, where and which is improving for the algorithm by Hazay, Lewenstein and Sokol. We also present a hashing based probabilistic algorithm for this problem when with time complexity, which we believe is algorithmically beautiful.

Paper Structure

This paper contains 11 sections, 3 theorems, 8 equations, 2 figures, 2 tables, 4 algorithms.

Key Result

Lemma 3.1

If one setting starting at $i_1$ requires some positions of the text to be discarded, another setting starting at $i_2$ may not require discarding those same positions in order to obtain the least number of mismatches. This lemma also applies to patterns.

Figures (2)

  • Figure 1: (a) Text $t=abcbbbaaaca$ and pattern $p=deeeef$ have aligned at 1st position. Here $k=2$ and possible way of matching has been shown. 3rd and 6th position have been discarded. (b) Same pattern has been aligned at 6th position. We see this time 5th and 6th have been discarded to have a parameterized match.
  • Figure 2: Example of constructing matching graph with $x=abcaaeebbcd$ and $y=adbeeaaddac$. Weights of the edges $(u,v)$ is the number position where $u$ and $v$ are aligned. Matched edges are thickened.

Theorems & Definitions (3)

  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3