Longest Common Extensions with Wildcards: Trade-off and Applications
Gabriel Bathie, Itai Boneh, Panagiotis Charalampopoulos, Jonas Ellert, Tatiana Starikovskaya
TL;DR
This work studies Longest Common Extension with wildcards (LCEW) by parameterizing the input string's wildcard structure with $G$, the number of maximal wildcard groups. It introduces a simple deterministic data structure achieving a flexible time-space trade-off: for any $t$ in $[1,G]$, use $O(nG/t)$ space, build in $O(n(G/t)\, ext{log} )$ time, and answer queries in $O(t)$ time, thereby smoothly interpolating between existing approaches. The authors connect LCEW to Boolean matrix multiplication to prove near-optimality (up to subpolynomial factors) for combinatorial data structures and derive a deterministic sparse Boolean matrix multiplication algorithm via LCEW reductions. They also establish conditional lower bounds from 3SUM and Set-Disjointness and demonstrate practical benefits by applying the LCEW framework to faster approximate pattern matching with wildcards and to the computation of periodicity-related arrays. Overall, the paper provides a cohesive theory linking LCEW data structures, lower bounds, and algorithmic applications in string processing with wildcards.
Abstract
We study the Longest Common Extension (LCE) problem in a string containing wildcards. Wildcards (also called "don't cares" or "holes") are special characters that match any other character in the alphabet, similar to the character "?" in Unix commands or "." in regular expression engines. We consider the problem parametrized by $G$, the number of maximal contiguous groups of wildcards in the input string. Our main contribution is a simple data structure for this problem that can be built in $O(n (G/t) \log n)$ time, occupies $O(nG/t)$ space, and answers queries in $O(t)$ time, for any $t \in [1, G]$. Up to the $O(\log n)$ factor, this interpolates smoothly between the data structure of Crochemore et al. [JDA 2015], which has $O(nG)$ preprocessing time and space, and $O(1)$ query time, and a simple solution based on the "kangaroo jumping" technique [Landau and Vishkin, STOC 1986], which has $O(n)$ preprocessing time and space, and $O(G)$ query time. By establishing a connection between this problem and Boolean matrix multiplication, we show that our solution is optimal, up to subpolynomial factors, among combinatorial data structures when $G = Ω(n^ε)$ under a widely believed hypothesis. In addition, we develop a simple deterministic combinatorial algorithm for sparse Boolean matrix multiplication. We further establish a conditional lower bound for non-combinatorial data structures, stating that $O(nG/t^4)$ preprocessing time (resp. space) is optimal, up to subpolynomial factors, for any data structure with query time $t$ for a wide range of $t$ and $G$, assuming the well-established $\textsf{3SUM}$ (resp. $\textsf{Set-Disjointness}$) conjecture. Finally, we show that our data structure can be used to obtain efficient algorithms for approximate pattern matching and structural analysis of strings with wildcards.
