Hamming Distance Oracle
Itai Boneh, Dvir Fried, Shay Golan, Matan Kraus
TL;DR
This work introduces the Hamming Distance Oracle problem: given strings $S$ and $T$, preprocess them to answer substring HD queries efficiently. It achieves a tunable preprocessing–query-time trade-off using a suffix-based dynamic programming approach, with a core dependence on the time to solve text-to-pattern HD queries, yielding $\tilde{O}(nm/x)$ preprocessing and $O(x)$ query time for constant alphabets, and $\tilde{O}(nm/\sqrt{x})$ preprocessing with the same query time for general alphabets. A central contribution is showing a conditional lower bound under a combinatorial matrix-multiplication conjecture, ruling out faster combinatorial trade-offs for the binary case, and extending the lower bound to the Hamming Distance Oracle via a reduction from Boolean matrix multiplication. The results place near-optimal limits on preprocessing and query-time trade-offs for Hamming distance queries on substrings, and connect the problem to fundamental questions in combinatorial matrix multiplication.
Abstract
In this paper, we present and study the \emph{Hamming distance oracle problem}. In this problem, the task is to preprocess two strings $S$ and $T$ of lengths $n$ and $m$, respectively, to obtain a data-structure that is able to answer queries regarding the Hamming distance between a substring of $S$ and a substring of $T$. For a constant size alphabet strings, we show that for every $x\le nm$ there is a data structure with $\tilde{O}(nm/x)$ preprocess time and $O(x)$ query time. We also provide a combinatorial conditional lower bound, showing that for every $\varepsilon > 0$ and $x \le nm$ there is no data structure with query time $O(x)$ and preprocess time $O((\frac{nm}{x})^{1-\varepsilon})$ unless combinatorial fast matrix multiplication is possible. For strings over general alphabet, we present a data structure with $\tilde{O}(nm/\sqrt{x})$ preprocess time and $O(x)$ query time for every $x \le nm$.
