Constant-time edge label and leaf pointer maintenance on sliding suffix trees

Laurentius Leonard; Shunsuke Inenaga; Hideo Bannai; Takuya Mieno

Constant-time edge label and leaf pointer maintenance on sliding suffix trees

Laurentius Leonard, Shunsuke Inenaga, Hideo Bannai, Takuya Mieno

TL;DR

This work tackles maintaining sliding suffix trees as a text window moves, where edge labels are stored as index-pairs that risk becoming outdated. It shows that valid edge index-pairs can be derived in constant time from leaf pointers, reducing edge-label maintenance to leaf-pointer maintenance, and introduces a new credit-free method for maintaining leaf pointers with a simple correctness proof. The proposed approach achieves $O(|T|)$ total time for leaf-pointer updates and $O(|W|)$ space, while preserving the overall $O(|T|\log\sigma)$ total time for the sliding suffix tree, matching prior performance but with improved worst-case guarantees. The combination of constant-time leaf-pointer updates and leaf-pointer-driven edge-label retrieval offers a simpler, more robust alternative to credit-based schemes and facilitates efficient online pattern matching within the sliding window; the work also raises questions about extending to multiple texts and further reducing iteration-time costs.

Abstract

Sliding suffix trees (Fiala & Greene, 1989) for an input text $T$ over an alphabet of size $σ$ and a sliding window $W$ of $T$ can be maintained in $O(|T| \log σ)$ time and $O(|W|)$ space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree can be supplemented with leaf pointers in order to find all occurrences of an online query pattern in the current window, and that leaf pointers can be maintained by credit-based arguments as well. The main difficulty in the credit-based approach is in the maintenance of index-pairs that represent each edge. In this paper, we show that valid edge index-pairs can be derived in constant time from leaf pointers, thus reducing the maintenance of edge index-pairs to the maintenance of leaf pointers. We further propose a new simple method that maintains leaf pointers without using credit-based arguments. The lack of credit-based arguments allow a simpler proof of correctness compared to the credit-based approach, whose analyses were initially flawed (Senft 2005). In addition, our method reduces the worst-case time of leaf pointer and edge label maintenance per leaf insertion and deletion from $Θ(|W|)$ time to $O(1)$ time.

Constant-time edge label and leaf pointer maintenance on sliding suffix trees

TL;DR

total time for leaf-pointer updates and

space, while preserving the overall

total time for the sliding suffix tree, matching prior performance but with improved worst-case guarantees. The combination of constant-time leaf-pointer updates and leaf-pointer-driven edge-label retrieval offers a simpler, more robust alternative to credit-based schemes and facilitates efficient online pattern matching within the sliding window; the work also raises questions about extending to multiple texts and further reducing iteration-time costs.

Abstract

Sliding suffix trees (Fiala & Greene, 1989) for an input text

over an alphabet of size

and a sliding window

can be maintained in

time and

space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree can be supplemented with leaf pointers in order to find all occurrences of an online query pattern in the current window, and that leaf pointers can be maintained by credit-based arguments as well. The main difficulty in the credit-based approach is in the maintenance of index-pairs that represent each edge. In this paper, we show that valid edge index-pairs can be derived in constant time from leaf pointers, thus reducing the maintenance of edge index-pairs to the maintenance of leaf pointers. We further propose a new simple method that maintains leaf pointers without using credit-based arguments. The lack of credit-based arguments allow a simpler proof of correctness compared to the credit-based approach, whose analyses were initially flawed (Senft 2005). In addition, our method reduces the worst-case time of leaf pointer and edge label maintenance per leaf insertion and deletion from

time to

time.

Paper Structure (29 sections, 11 theorems, 27 figures, 8 algorithms)

This paper contains 29 sections, 11 theorems, 27 figures, 8 algorithms.

Introduction
Background
Paper organization
Preliminaries
Strings
Suffix trees
Sliding suffix trees
Leaf pointers
Online pattern matching with leaf pointers
Maintaining the tree structure
Insertion using Ukkonen's algorithm
Deletion
Proposed method
Getting fresh index-pairs from leaf pointers
Maintaining leaf pointers in linear total time
...and 14 more sections

Key Result

Lemma 1

The leaves located at or below $P$ correspond exactly to the occurrences of $P$ occurring with start-index in the interval $[1 .. |W|-|\mathit{lrs}|]$.

Figures (27)

Figure 1: Sliding suffix tree across two iterations with $T=\mathtt{abacabaca}$ and $|W|=5$. $\mathit{leaf}(1)$ is deleted in the second iteration due to its corresponding suffix $\mathtt{abaca}$, which would grow into $\mathtt{abacab}$, not existing in the new window. The edge labeled $\mathtt{a}$ shows an example of an edge whose index-pair becomes outdated; if it was represented by $\langle 1,1 \rangle$ in the first iteration, an update is required as the index $1$ contained in the interval is no longer inside the window.
Figure 2: Two possible configurations of leaf pointers for the suffix tree of $\mathtt{abaca}$. The dashed arrows depict leaf pointers.
Figure 3: Case 3-1 for online matching with leaf pointers. The circles denote start-indices of occurrences of $P$. All occurrences of $P$ with start-indices within $[p_2..q_2]$ correspond to leaves found by traversal. From these occurrences, we can derive the occurrences with start-indices within $[p_1..q_1]$, as $W[p_1..q_1]=W[p_2..q_2]=\mathit{lrs}$.
Figure 4: Case 3-2 for online matching with leaf pointers. The circles denote start-indices of occurrences of $P$. The two rightmost circles, shown with dotted outlines, show where derived occurrences of $P$ would be if they were not out of bounds. All occurrences of $P$ with start-indices within $[p_2..p_1]$, i.e., the leftmost $y$ correspond to leaves found by traversal. From these occurrences, we can derive the occurrences with start-indices within $[p_1..q_1]$, as $W[p_1..q_1]$ is simply further repetitions of $y$.
Figure 5: The subtrees starting with $a$ in the suffix trees of $W=\mathtt{axazaz}$ and $W'=\mathtt{xazaz}$. The diamond shape represents the active point. While the active point $\mathit{lrs}=\mathit{lrs}'=\mathtt{az}$ remains unchanged, the locus representation may need to be updated, as it was on the edge $w \rightsquigarrow y$ before deletion and on the edge $x \rightsquigarrow y$ after.
...and 22 more figures

Theorems & Definitions (19)

Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
Lemma 5
Lemma 6
...and 9 more

Constant-time edge label and leaf pointer maintenance on sliding suffix trees

TL;DR

Abstract

Constant-time edge label and leaf pointer maintenance on sliding suffix trees

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (27)

Theorems & Definitions (19)