Faster Algorithms for Text-to-Pattern Hamming Distances
Timothy M. Chan, Ce Jin, Virginia Vassilevska Williams, Yinzhan Xu
TL;DR
The paper advances Text-to-Pattern Hamming Distances by delivering faster exact and approximate algorithms in the Word-RAM model. It introduces a novel sumset/counted-convolution framework combining Fredman’s trick, Equality Product, hashing, and bit-packed FFT to achieve an exact $O(n\sqrt{m})$ Las Vegas algorithm and a near-deterministic bound with polylogarithmic factors; it also provides a subpolynomially better $(1+\varepsilon)$-approximation time of $\tilde{O}(\varepsilon^{-0.93}n)$. A key technical contribution is the X+Y Lemma for efficient sumset counting in bounded universes, plus a detailed reduction of TT-Hamming to colored 3SUM variants, enabling subquadratic and near-linear performance across several problem variants, including $k$-mismatch and dominance matching. The results establish a tight connection between TT-Hamming and a variant of 3SUM, offering both algorithmic progress and fine-grained complexity insights with potential implications for related convolution and sumset problems. Overall, the work significantly sharpens the boundary between linear- and subquadratic-time string matching tasks in the fine-grained landscape and opens several avenues for future improvements and extensions to other norms.
Abstract
We study the classic Text-to-Pattern Hamming Distances problem: given a pattern $P$ of length $m$ and a text $T$ of length $n$, both over a polynomial-size alphabet, compute the Hamming distance between $P$ and $T[i\, .\, . \, i+m-1]$ for every shift $i$, under the standard Word-RAM model with $Θ(\log n)$-bit words. - We provide an $O(n\sqrt{m})$ time Las Vegas randomized algorithm for this problem, beating the decades-old $O(n \sqrt{m \log m})$ running time [Abrahamson, SICOMP 1987]. We also obtain a deterministic algorithm, with a slightly higher $O(n\sqrt{m}(\log m\log\log m)^{1/4})$ running time. Our randomized algorithm extends to the $k$-bounded setting, with running time $O\big(n+\frac{nk}{\sqrt{m}}\big)$, removing all the extra logarithmic factors from earlier algorithms [Gawrychowski and Uznański, ICALP 2018; Chan, Golan, Kociumaka, Kopelowitz and Porat, STOC 2020]. - For the $(1+ε)$-approximate version of Text-to-Pattern Hamming Distances, we give an $\tilde{O}(ε^{-0.93}n)$ time Monte Carlo randomized algorithm, beating the previous $\tilde{O}(ε^{-1}n)$ running time [Kopelowitz and Porat, FOCS 2015; Kopelowitz and Porat, SOSA 2018]. Our approximation algorithm exploits a connection with $3$SUM, and uses a combination of Fredman's trick, equality matrix product, and random sampling; in particular, we obtain new results on approximate counting versions of $3$SUM and Exact Triangle, which may be of independent interest. Our exact algorithms use a novel combination of hashing, bit-packed FFT, and recursion; in particular, we obtain a faster algorithm for computing the sumset of two integer sets, in the regime when the universe size is close to quadratic in the number of elements. We also prove a fine-grained equivalence between the exact Text-to-Pattern Hamming Distances problem and a range-restricted, counting version of $3$SUM.
