Space-Efficient k-Mismatch Text Indexes
Tomasz Kociumaka, Jakub Radoszewski
TL;DR
This work advances approximate text indexing by focusing on the k-mismatch problem and improving the space-time trade-offs of the foundational k-errata tree framework. It introduces a two-pronged strategy that separately handles short and long patterns, leveraging compact tries, modified suffixes, synchronizing sets, Lyndon-root runs, and two-dimensional range reporting to assemble a general index with space $O(n log^{k-1} n)$ and query time $O(log^k n log log n + m + occ)$ for arbitrary alphabets, plus tighter bounds for constant alphabets and small $k$. Central technical contributions include two refinements for short patterns (base-interval decomposition and sequence-based rank/select compression) and a comprehensive long-pattern approach based on anchors and synchronizing techniques, complemented by wildcard-aware variants and 1-modified suffix storage. The results unify and extend decades of prior work, offering practical improvements for large alphabets and short patterns, and lay groundwork for further space reductions and faster queries in k-mismatch indexing. The methods rely on robust data-structuring tools (TreeLCP, 2D range reporting, and heavy-path decompositions) and illustrate how carefully coordinated local edits and global anchors can yield substantial space savings without sacrificing query efficiency.
Abstract
A central task in string processing is text indexing, where the goal is to preprocess a text (a string of length $n$) into an efficient index (a data structure) supporting queries about the text. Cole, Gottlieb, and Lewenstein (STOC 2004) proposed $k$-errata trees, a family of text indexes supporting approximate pattern matching queries of several types. In particular, $k$-errata trees yield an elegant solution to $k$-mismatch queries, where we are to report all substrings of the text with Hamming distance at most $k$ to the query pattern. The resulting $k$-mismatch index uses $O(n\log^k n)$ space and answers a query for a length-$m$ pattern in $O(\log^k n \log \log n + m + occ)$ time, where $occ$ is the number of approximate occurrences. In retrospect, $k$-errata trees appear very well optimized: even though a large body of work has adapted $k$-errata trees to various settings throughout the past two decades, the original time-space trade-off for $k$-mismatch indexing has not been improved in the general case. We present the first such improvement, a $k$-mismatch index with $O(n\log^{k-1} n)$ space and the same query time as $k$-errata trees. Previously, due to a result of Chan, Lam, Sung, Tam, and Wong (Algorithmica 2010), such an $O(n\log^{k-1} n)$-size index has been known only for texts over alphabets of constant size. In this setting, however, we obtain an even smaller $k$-mismatch index of size only $O(n \log^{k-2+\varepsilon+\frac{2}{k+2-(k \bmod 2)}} n)\subseteq O(n\log^{k-1.5+\varepsilon} n)$ for $2\le k\le O(1)$ and any constant $\varepsilon>0$. Along the way, we also develop improved indexes for short patterns, offering better trade-offs in this practically relevant special case.
