Table of Contents
Fetching ...

Faster ED-String Matching with $k$ Mismatches

Paweł Gawrychowski, Adam Górkiewicz, Pola Marciniak, Solon P. Pissis, Karol Pokorski

TL;DR

This work addresses pattern matching with $k$ mismatches in elastic-degenerate strings (ED-strings), a compact representation of many similar sequences such as pangenomes. The authors achieve a faster $ ilde{O}(nm^{1.5}+N)$ time algorithm for constant $k$ by leveraging the structural characterization of Charalampopoulos et al. together with fast Fourier transform, and they extend the approach to handle multiple patterns simultaneously through refined multi-pattern analysis. The key technical contributions include a reduction to Active Prefixes Extension (APE) with $k$ mismatches, a split analysis into very short/short/long pattern regimes, and a novel refinement that groups occurrence progressions into a small number of classes for efficient FFT-based processing. The results significantly improve the previously best-known $m^2$-dependent bounds for constant $k$, advancing practical approximate matching in large ED-text collections and offering techniques that may be of independent interest for multi-pattern string matching with structural regularities.

Abstract

We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of $n$ finite sets of strings of total length $N$, and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences. The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length $m$, check if the pattern occurs in one of the strings in the described collection. Bernardini et al.~[SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an $\tilde{\mathcal{O}}(nm^{ω-1})+\mathcal{O}(N)$-time complexity for this problem, where $w$ is the matrix multiplication exponent. However, the best result so far for finding occurrences with $k$ mismatches, where $k$ is a constant, is the $\tilde{\mathcal{O}}(nm^{2}+N)$-time algorithm of Pissis et al.~[CPM 2025]. This brings the question whether increasing the dependency on $m$ from $m^{ω-1}$ to quadratic is necessary when moving from $k=0$ to larger (but still constant) $k$. We design an $\tilde{\mathcal{O}}(nm^{1.5}+N)$-time algorithm for pattern matching with $k$ mismatches in an elastic-degenerate string, for any constant $k$. To obtain this time bound, we leverage the structural characterization of occurrences with $k$ mismatches of Charalampopoulos et al.~[FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.

Faster ED-String Matching with $k$ Mismatches

TL;DR

This work addresses pattern matching with mismatches in elastic-degenerate strings (ED-strings), a compact representation of many similar sequences such as pangenomes. The authors achieve a faster time algorithm for constant by leveraging the structural characterization of Charalampopoulos et al. together with fast Fourier transform, and they extend the approach to handle multiple patterns simultaneously through refined multi-pattern analysis. The key technical contributions include a reduction to Active Prefixes Extension (APE) with mismatches, a split analysis into very short/short/long pattern regimes, and a novel refinement that groups occurrence progressions into a small number of classes for efficient FFT-based processing. The results significantly improve the previously best-known -dependent bounds for constant , advancing practical approximate matching in large ED-text collections and offering techniques that may be of independent interest for multi-pattern string matching with structural regularities.

Abstract

We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of finite sets of strings of total length , and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences. The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length , check if the pattern occurs in one of the strings in the described collection. Bernardini et al.~[SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an -time complexity for this problem, where is the matrix multiplication exponent. However, the best result so far for finding occurrences with mismatches, where is a constant, is the -time algorithm of Pissis et al.~[CPM 2025]. This brings the question whether increasing the dependency on from to quadratic is necessary when moving from to larger (but still constant) . We design an -time algorithm for pattern matching with mismatches in an elastic-degenerate string, for any constant . To obtain this time bound, we leverage the structural characterization of occurrences with mismatches of Charalampopoulos et al.~[FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.

Paper Structure

This paper contains 31 sections, 22 theorems, 36 equations, 3 figures, 1 table.

Key Result

Theorem 1.1

Given a pattern $P$ of length $m$ and an ED-string $\widetilde{T}$ of length $n$ and size $N$, EDSM with $k$ Mismatches, for $k = 1$, can be solved in $\mathcal{O}(nm^{1.5}\mathop{\mathrm{polylog}}\nolimits m + N)$ time.

Figures (3)

  • Figure 1: MSA of three strings and the corresponding (non-unique) ED-string $\widetilde{T}$.
  • Figure 2: An example of EDSM with $k$ Mismatches for $k=1$.
  • Figure 3: Regions $R_{0},R_{1},\dots,R_{d}$. Black rectangles denote the elements of $\operatorname{Mis}(T, \bar{T})$.

Theorems & Definitions (33)

  • Theorem 1.1
  • Theorem 1.2
  • Lemma 1: suffix tree Farach with LCA queries HarelT84
  • Lemma 2
  • Lemma 3: Farach
  • Lemma 4
  • Theorem 4.1
  • proof
  • Lemma 5
  • proof
  • ...and 23 more