String Indexing for Top-$k$ Close Consecutive Occurrences

Philip Bille; Inge Li Gørtz; Max Rishøj Pedersen; Eva Rotenberg; Teresa Anna Steiner

String Indexing for Top-$k$ Close Consecutive Occurrences

Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, Teresa Anna Steiner

TL;DR

This work introduces string indexing for top-$k$ close consecutive occurrences (SITCCO), a natural extension of pattern indexing that reports the $k$ closest consecutive occurrences of a pattern in a text. It develops three main data-structures: (i) a simple $O(n\log n)$ space solution with $O(m+k)$ query time via heavy-path decomposition and a line-segment representation, (ii) a linear-space solution for fixed $k$ using a cluster decomposition of the suffix tree, and (iii) a linear-space scheme extended to general $k$ with $O(n\log\log n)$ space and $O(m+k^2)$ time, plus a linear-space version with $O(m+k^2)$ time. A further trade-off using orthogonal range successor yields $O(m+k\log^{1+\varepsilon} n)$ time at $O(n/\varepsilon)$ space. The authors also extend the framework to related problems like top-$k$ far consecutive occurrences and consecutive occurrences with distance constraints, introducing new techniques such as translating to line-segment intersections on heavy paths and recursive tree clustering, with rank-space compression and boundary-node data for space efficiency. These results provide near-optimal indexing methods for proximity-constrained pattern matching and open avenues for efficient non-overlapping and interval-based variants in string databases.

Abstract

The classic string indexing problem is to preprocess a string $S$ into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string $P$, report all occurrences of $P$ within $S$. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-$k$ close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair $(i,j)$, $i < j$, such that $P$ occurs at positions $i$ and $j$ in $S$ and there is no occurrence of $P$ between $i$ and $j$, and their distance is defined as $j-i$. Given a pattern $P$ and a parameter $k$, the goal is to report the top-$k$ consecutive occurrences of $P$ in $S$ of minimal distance. The challenge is to compactly represent $S$ while supporting queries in time close to the length of $P$ and $k$. We give three time-space trade-offs for the problem. Let $n$ be the length of $S$, $m$ the length of $P$, and $ε\in(0,1]$. Our first result achieves $O(n\log n)$ space and optimal query time of $O(m+k)$. Our second and third results achieve linear space and query times either $O(m+k^{1+ε})$ or $O(m + k \log^{1+ε} n)$. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

String Indexing for Top-$k$ Close Consecutive Occurrences

TL;DR

This work introduces string indexing for top-

close consecutive occurrences (SITCCO), a natural extension of pattern indexing that reports the

closest consecutive occurrences of a pattern in a text. It develops three main data-structures: (i) a simple

space solution with

query time via heavy-path decomposition and a line-segment representation, (ii) a linear-space solution for fixed

using a cluster decomposition of the suffix tree, and (iii) a linear-space scheme extended to general

with

space and

time, plus a linear-space version with

time. A further trade-off using orthogonal range successor yields

time at

space. The authors also extend the framework to related problems like top-

far consecutive occurrences and consecutive occurrences with distance constraints, introducing new techniques such as translating to line-segment intersections on heavy paths and recursive tree clustering, with rank-space compression and boundary-node data for space efficiency. These results provide near-optimal indexing methods for proximity-constrained pattern matching and open avenues for efficient non-overlapping and interval-based variants in string databases.

Abstract

The classic string indexing problem is to preprocess a string

into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string

, report all occurrences of

within

. In this paper, we study a basic and natural extension of string indexing called the string indexing for top-

close consecutive occurrences problem (SITCCO). Here, a consecutive occurrence is a pair

, such that

occurs at positions

and

and there is no occurrence of

between

and

, and their distance is defined as

. Given a pattern

and a parameter

, the goal is to report the top-

consecutive occurrences of

of minimal distance. The challenge is to compactly represent

while supporting queries in time close to the length of

and

. We give three time-space trade-offs for the problem. Let

be the length of

the length of

, and

. Our first result achieves

space and optimal query time of

. Our second and third results achieve linear space and query times either

. Along the way, we develop several techniques of independent interest, including a new translation of the problem into a line segment intersection problem and a new recursive clustering technique for trees.

String Indexing for Top-$k$ Close Consecutive Occurrences

TL;DR

Abstract

String Indexing for Top-$k$ Close Consecutive Occurrences

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (20)