String Indexing for Top-$k$ Close Consecutive Occurrences
Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, Teresa Anna Steiner
TL;DR
This work introduces string indexing for top-$k$ close consecutive occurrences (SITCCO), a natural extension of pattern indexing that reports the $k$ closest consecutive occurrences of a pattern in a text. It develops three main data-structures: (i) a simple $O(n\log n)$ space solution with $O(m+k)$ query time via heavy-path decomposition and a line-segment representation, (ii) a linear-space solution for fixed $k$ using a cluster decomposition of the suffix tree, and (iii) a linear-space scheme extended to general $k$ with $O(n\log\log n)$ space and $O(m+k^2)$ time, plus a linear-space version with $O(m+k^2)$ time. A further trade-off using orthogonal range successor yields $O(m+k\log^{1+\varepsilon} n)$ time at $O(n/\varepsilon)$ space. The authors also extend the framework to related problems like top-$k$ far consecutive occurrences and consecutive occurrences with distance constraints, introducing new techniques such as translating to line-segment intersections on heavy paths and recursive tree clustering, with rank-space compression and boundary-node data for space efficiency. These results provide near-optimal indexing methods for proximity-constrained pattern matching and open avenues for efficient non-overlapping and interval-based variants in string databases.
Abstract
The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair $(i,j)$, $i < j$, such that $P$ occurs at positions $i$ and $j$ in $S$ and there is no occurrence of $P$ between $i$ and $j$, and their distance is defined as $j-i$. Given a pattern $P$ and a parameter $k$, the goal is to report the top-$k$ consecutive occurrences of $P$ in $S$ of minimal distance. The challenge is to compactly represent $S$ while supporting queries in time close to the length of $P$ and $k$. We give three time-space trade-offs for the problem. Let $n$ be the length of $S$, $m$ the length of $P$, and $ε\in(0,1]$. Our first result achieves $O(n\log n)$ space and optimal query time of $O(m+k)$. Our second and third results achieve linear space and query times either $O(m+k^{1+ε})$ or $O(m + k \log^{1+ε} n)$. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.
