Table of Contents
Fetching ...

Space-Efficient Text Indexing with Mismatches using Function Inversion

Jackson Bibbens, Levi Borevitz, Samuel McCauley

Abstract

A classic data structure problem is to preprocess a string T of length $n$ so that, given a query $q$, we can quickly find all substrings of T with Hamming distance at most $k$ from the query string. Variants of this problem have seen significant research both in theory and in practice. For a wide parameter range, the best worst-case bounds are achieved by the "CGL tree" (Cole, Gottlieb, Lewenstein 2004), which achieves query time roughly $\tilde{O}(|q| + \log^k n + \# occ)$ where $\# occ$ is the size of the output, and space ${O}(n\log^k n)$. The CGL Tree space was recently improved to $O(n \log^{k-1} n)$ (Kociumaka, Radoszewski 2026). A natural question is whether a high space bound is necessary. How efficient can we make queries when the data structure is constrained to $O(n)$ space? While this question has seen extensive research, all known results have query time with unfavorable dependence on $n$, $k$, and the alphabet $Σ$. The state of the art query time (Chan et al. 2011) is roughly $\tilde{O}(|q| + |Σ|^k \log^{k^2 + k} n + \# occ)$. We give an $O(n)$-space data structure with query time roughly $\tilde{O}(|q| + \log^{4k} n + \log^{2k} n \# occ)$, with no dependence on $|Σ|$. Even if $|Σ| = O(1)$, this is the best known query time for linear space if $k\geq 3$ unless $\# occ$ is large. Our results give a smooth tradeoff between time and space. We also give the first sublinear-space results: we give a succinct data structure using only $o(n)$ space in addition to the text itself. Our main technical idea is to apply function inversion (Fiat, Naor 2000) to the CGL tree. Combining these techniques is not immediate; in fact, we revisit the exposition of both the Fiat-Naor data structure and the CGL tree to obtain our bounds. Along the way, we obtain improved performance for both data structures, which may be of independent interest.

Space-Efficient Text Indexing with Mismatches using Function Inversion

Abstract

A classic data structure problem is to preprocess a string T of length so that, given a query , we can quickly find all substrings of T with Hamming distance at most from the query string. Variants of this problem have seen significant research both in theory and in practice. For a wide parameter range, the best worst-case bounds are achieved by the "CGL tree" (Cole, Gottlieb, Lewenstein 2004), which achieves query time roughly where is the size of the output, and space . The CGL Tree space was recently improved to (Kociumaka, Radoszewski 2026). A natural question is whether a high space bound is necessary. How efficient can we make queries when the data structure is constrained to space? While this question has seen extensive research, all known results have query time with unfavorable dependence on , , and the alphabet . The state of the art query time (Chan et al. 2011) is roughly . We give an -space data structure with query time roughly , with no dependence on . Even if , this is the best known query time for linear space if unless is large. Our results give a smooth tradeoff between time and space. We also give the first sublinear-space results: we give a succinct data structure using only space in addition to the text itself. Our main technical idea is to apply function inversion (Fiat, Naor 2000) to the CGL tree. Combining these techniques is not immediate; in fact, we revisit the exposition of both the Fiat-Naor data structure and the CGL tree to obtain our bounds. Along the way, we obtain improved performance for both data structures, which may be of independent interest.

Paper Structure

This paper contains 88 sections, 28 theorems, 23 equations, 5 tables.

Key Result

Theorem 1

For any $\sigma$ satisfying $1 \leq \sigma \leq [b]{\binom{\log n}{k}}$, there exists a Text Indexing with Mismatches data structure with $O(n[b]{\binom{\log n}{k}}/\sigma)$ space that can be constructed in $O(nk^2[b]{\binom{\log n}{k}} (\log n + k^2(\log\log n)^2))$ expected time and can answer que $\blacktriangleleft$$\blacktriangleleft$

Theorems & Definitions (51)

  • Theorem 1
  • Corollary 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Lemma 6
  • Definition 7: Recursive Subsets
  • Lemma 8
  • proof
  • Definition 9: Traversal Destination; Manual Traversal
  • ...and 41 more