Faster ED-String Matching with $k$ Mismatches
Paweł Gawrychowski, Adam Górkiewicz, Pola Marciniak, Solon P. Pissis, Karol Pokorski
TL;DR
This work addresses pattern matching with $k$ mismatches in elastic-degenerate strings (ED-strings), a compact representation of many similar sequences such as pangenomes. The authors achieve a faster $ ilde{O}(nm^{1.5}+N)$ time algorithm for constant $k$ by leveraging the structural characterization of Charalampopoulos et al. together with fast Fourier transform, and they extend the approach to handle multiple patterns simultaneously through refined multi-pattern analysis. The key technical contributions include a reduction to Active Prefixes Extension (APE) with $k$ mismatches, a split analysis into very short/short/long pattern regimes, and a novel refinement that groups occurrence progressions into a small number of classes for efficient FFT-based processing. The results significantly improve the previously best-known $m^2$-dependent bounds for constant $k$, advancing practical approximate matching in large ED-text collections and offering techniques that may be of independent interest for multi-pattern string matching with structural regularities.
Abstract
We revisit the complexity of approximate pattern matching in an elastic-degenerate string. Such a string is a sequence of $n$ finite sets of strings of total length $N$, and compactly describes a collection of strings obtained by first choosing exactly one string in every set, and then concatenating them together. This is motivated by the need of storing a collection of highly similar DNA sequences. The basic algorithmic question on elastic-degenerate strings is pattern matching: given such an elastic-degenerate string and a standard pattern of length $m$, check if the pattern occurs in one of the strings in the described collection. Bernardini et al.~[SICOMP 2022] showed how to leverage fast matrix multiplication to obtain an $\tilde{\mathcal{O}}(nm^{ω-1})+\mathcal{O}(N)$-time complexity for this problem, where $w$ is the matrix multiplication exponent. However, the best result so far for finding occurrences with $k$ mismatches, where $k$ is a constant, is the $\tilde{\mathcal{O}}(nm^{2}+N)$-time algorithm of Pissis et al.~[CPM 2025]. This brings the question whether increasing the dependency on $m$ from $m^{ω-1}$ to quadratic is necessary when moving from $k=0$ to larger (but still constant) $k$. We design an $\tilde{\mathcal{O}}(nm^{1.5}+N)$-time algorithm for pattern matching with $k$ mismatches in an elastic-degenerate string, for any constant $k$. To obtain this time bound, we leverage the structural characterization of occurrences with $k$ mismatches of Charalampopoulos et al.~[FOCS 2020] together with fast Fourier transform. We need to work with multiple patterns at the same time, instead of a single pattern, which requires refining the original characterization. This might be of independent interest.
