Table of Contents
Fetching ...

String Indexing for Top-$k$ Close Consecutive Occurrences

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, Teresa Anna Steiner

TL;DR

This work introduces string indexing for top-$k$ close consecutive occurrences (SITCCO), a natural extension of pattern indexing that reports the $k$ closest consecutive occurrences of a pattern in a text. It develops three main data-structures: (i) a simple $O(n\log n)$ space solution with $O(m+k)$ query time via heavy-path decomposition and a line-segment representation, (ii) a linear-space solution for fixed $k$ using a cluster decomposition of the suffix tree, and (iii) a linear-space scheme extended to general $k$ with $O(n\log\log n)$ space and $O(m+k^2)$ time, plus a linear-space version with $O(m+k^2)$ time. A further trade-off using orthogonal range successor yields $O(m+k\log^{1+\varepsilon} n)$ time at $O(n/\varepsilon)$ space. The authors also extend the framework to related problems like top-$k$ far consecutive occurrences and consecutive occurrences with distance constraints, introducing new techniques such as translating to line-segment intersections on heavy paths and recursive tree clustering, with rank-space compression and boundary-node data for space efficiency. These results provide near-optimal indexing methods for proximity-constrained pattern matching and open avenues for efficient non-overlapping and interval-based variants in string databases.

Abstract

The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair $(i,j)$, $i < j$, such that $P$ occurs at positions $i$ and $j$ in $S$ and there is no occurrence of $P$ between $i$ and $j$, and their distance is defined as $j-i$. Given a pattern $P$ and a parameter $k$, the goal is to report the top-$k$ consecutive occurrences of $P$ in $S$ of minimal distance. The challenge is to compactly represent $S$ while supporting queries in time close to the length of $P$ and $k$. We give three time-space trade-offs for the problem. Let $n$ be the length of $S$, $m$ the length of $P$, and $ε\in(0,1]$. Our first result achieves $O(n\log n)$ space and optimal query time of $O(m+k)$. Our second and third results achieve linear space and query times either $O(m+k^{1+ε})$ or $O(m + k \log^{1+ε} n)$. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

String Indexing for Top-$k$ Close Consecutive Occurrences

TL;DR

This work introduces string indexing for top- close consecutive occurrences (SITCCO), a natural extension of pattern indexing that reports the closest consecutive occurrences of a pattern in a text. It develops three main data-structures: (i) a simple space solution with query time via heavy-path decomposition and a line-segment representation, (ii) a linear-space solution for fixed using a cluster decomposition of the suffix tree, and (iii) a linear-space scheme extended to general with space and time, plus a linear-space version with time. A further trade-off using orthogonal range successor yields time at space. The authors also extend the framework to related problems like top- far consecutive occurrences and consecutive occurrences with distance constraints, introducing new techniques such as translating to line-segment intersections on heavy paths and recursive tree clustering, with rank-space compression and boundary-node data for space efficiency. These results provide near-optimal indexing methods for proximity-constrained pattern matching and open avenues for efficient non-overlapping and interval-based variants in string databases.

Abstract

The classic string indexing problem is to preprocess a string into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string , report all occurrences of within . In this paper, we study a basic and natural extension of string indexing called the string indexing for top- close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair , , such that occurs at positions and in and there is no occurrence of between and , and their distance is defined as . Given a pattern and a parameter , the goal is to report the top- consecutive occurrences of in of minimal distance. The challenge is to compactly represent while supporting queries in time close to the length of and . We give three time-space trade-offs for the problem. Let be the length of , the length of , and . Our first result achieves space and optimal query time of . Our second and third results achieve linear space and query times either or . Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

Paper Structure

This paper contains 29 sections, 14 theorems, 4 equations, 5 figures.

Key Result

Theorem 1

Given a string $S$ of length $n$ and $\epsilon$, $0<\epsilon\le 1$, we can build a data structure that can answer top-$k$ close consecutive occurrences queries using either Here, $m$ is the length of the query pattern.

Figures (5)

  • Figure 1: $P$ occurs at positions 4, 7, 11, 22, 24, 26, 30, 39 and 41 in $S$. The top $5$ close consecutive occurrences are $(22,24)$, $(24,26)$, $(39,41)$, $(4,7)$, and $(7,11)$, with the tie between $(7,11)$ and $(26,30)$ broken arbitrarily.
  • Figure 2: Line segments for a heavy path from the suffix tree for "BATMAN-AND-ANNA-SING-NANANANA-AND-EAT-BANANAS". Here, if we have overlapping line segments, we denote by a number how many consecutive occurrences the current segment corresponds to. At depth 1, we have a line segment corresponding to pairs of consecutive occurrences of string A - there are six pairs that have a distance of 2, three pairs that have a distance of 3, two pairs that have a distance of 4, and so on. At depth 2, we encode the consecutive occurrences of string AN. Some of them are the same as for string A.
  • Figure 3: The suffix tree is divided into clusters (grey loops) of size $\le k$ which are either leaf clusters, or path clusters with spines marked in red. For every spine we store a line segment data structure, also marked in red.
  • Figure 4: Here, we see the recursive clustering: The black clustering is the coarsest clustering and the green and blue are finer sub-clusterings.
  • Figure 5: Illustration of a pair defining more than one line segment. To the left are the positions of the occurrences in $S$, in the middle is the spine of a cluster and to the right are the corresponding line segments. The pair $(i,j)$ is amongst the to $k$ farthest until the occurrence $x$ disappears, after which it is pushed out by the pair $(a,b)$. When $b$ then disappears, $(i,j)$ is again amongst the $k$ farthest.

Theorems & Definitions (20)

  • Theorem 1
  • Lemma 2: Sleator and Tarjan SLEATOR1983362
  • Lemma 3
  • proof
  • Lemma 4
  • Lemma 5
  • proof
  • Lemma 6
  • Lemma 7
  • Claim 8
  • ...and 10 more