Table of Contents
Fetching ...

Constant-time edge label and leaf pointer maintenance on sliding suffix trees

Laurentius Leonard, Shunsuke Inenaga, Hideo Bannai, Takuya Mieno

TL;DR

This work tackles maintaining sliding suffix trees as a text window moves, where edge labels are stored as index-pairs that risk becoming outdated. It shows that valid edge index-pairs can be derived in constant time from leaf pointers, reducing edge-label maintenance to leaf-pointer maintenance, and introduces a new credit-free method for maintaining leaf pointers with a simple correctness proof. The proposed approach achieves $O(|T|)$ total time for leaf-pointer updates and $O(|W|)$ space, while preserving the overall $O(|T|\log\sigma)$ total time for the sliding suffix tree, matching prior performance but with improved worst-case guarantees. The combination of constant-time leaf-pointer updates and leaf-pointer-driven edge-label retrieval offers a simpler, more robust alternative to credit-based schemes and facilitates efficient online pattern matching within the sliding window; the work also raises questions about extending to multiple texts and further reducing iteration-time costs.

Abstract

Sliding suffix trees (Fiala & Greene, 1989) for an input text $T$ over an alphabet of size $σ$ and a sliding window $W$ of $T$ can be maintained in $O(|T| \log σ)$ time and $O(|W|)$ space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree can be supplemented with leaf pointers in order to find all occurrences of an online query pattern in the current window, and that leaf pointers can be maintained by credit-based arguments as well. The main difficulty in the credit-based approach is in the maintenance of index-pairs that represent each edge. In this paper, we show that valid edge index-pairs can be derived in constant time from leaf pointers, thus reducing the maintenance of edge index-pairs to the maintenance of leaf pointers. We further propose a new simple method that maintains leaf pointers without using credit-based arguments. The lack of credit-based arguments allow a simpler proof of correctness compared to the credit-based approach, whose analyses were initially flawed (Senft 2005). In addition, our method reduces the worst-case time of leaf pointer and edge label maintenance per leaf insertion and deletion from $Θ(|W|)$ time to $O(1)$ time.

Constant-time edge label and leaf pointer maintenance on sliding suffix trees

TL;DR

This work tackles maintaining sliding suffix trees as a text window moves, where edge labels are stored as index-pairs that risk becoming outdated. It shows that valid edge index-pairs can be derived in constant time from leaf pointers, reducing edge-label maintenance to leaf-pointer maintenance, and introduces a new credit-free method for maintaining leaf pointers with a simple correctness proof. The proposed approach achieves total time for leaf-pointer updates and space, while preserving the overall total time for the sliding suffix tree, matching prior performance but with improved worst-case guarantees. The combination of constant-time leaf-pointer updates and leaf-pointer-driven edge-label retrieval offers a simpler, more robust alternative to credit-based schemes and facilitates efficient online pattern matching within the sliding window; the work also raises questions about extending to multiple texts and further reducing iteration-time costs.

Abstract

Sliding suffix trees (Fiala & Greene, 1989) for an input text over an alphabet of size and a sliding window of can be maintained in time and space. The two previous approaches that achieve this can be categorized into the credit-based approach of Fiala and Greene (1989) and Larsson (1996, 1999), or the batch-based approach proposed by Senft (2005). Brodnik and Jekovec (2018) showed that the sliding suffix tree can be supplemented with leaf pointers in order to find all occurrences of an online query pattern in the current window, and that leaf pointers can be maintained by credit-based arguments as well. The main difficulty in the credit-based approach is in the maintenance of index-pairs that represent each edge. In this paper, we show that valid edge index-pairs can be derived in constant time from leaf pointers, thus reducing the maintenance of edge index-pairs to the maintenance of leaf pointers. We further propose a new simple method that maintains leaf pointers without using credit-based arguments. The lack of credit-based arguments allow a simpler proof of correctness compared to the credit-based approach, whose analyses were initially flawed (Senft 2005). In addition, our method reduces the worst-case time of leaf pointer and edge label maintenance per leaf insertion and deletion from time to time.
Paper Structure (29 sections, 11 theorems, 27 figures, 8 algorithms)

This paper contains 29 sections, 11 theorems, 27 figures, 8 algorithms.

Key Result

Lemma 1

The leaves located at or below $P$ correspond exactly to the occurrences of $P$ occurring with start-index in the interval $[1 .. |W|-|\mathit{lrs}|]$.

Figures (27)

  • Figure 1: Sliding suffix tree across two iterations with $T=\mathtt{abacabaca}$ and $|W|=5$. $\mathit{leaf}(1)$ is deleted in the second iteration due to its corresponding suffix $\mathtt{abaca}$, which would grow into $\mathtt{abacab}$, not existing in the new window. The edge labeled $\mathtt{a}$ shows an example of an edge whose index-pair becomes outdated; if it was represented by $\langle 1,1 \rangle$ in the first iteration, an update is required as the index $1$ contained in the interval is no longer inside the window.
  • Figure 2: Two possible configurations of leaf pointers for the suffix tree of $\mathtt{abaca}$. The dashed arrows depict leaf pointers.
  • Figure 3: Case 3-1 for online matching with leaf pointers. The circles denote start-indices of occurrences of $P$. All occurrences of $P$ with start-indices within $[p_2..q_2]$ correspond to leaves found by traversal. From these occurrences, we can derive the occurrences with start-indices within $[p_1..q_1]$, as $W[p_1..q_1]=W[p_2..q_2]=\mathit{lrs}$.
  • Figure 4: Case 3-2 for online matching with leaf pointers. The circles denote start-indices of occurrences of $P$. The two rightmost circles, shown with dotted outlines, show where derived occurrences of $P$ would be if they were not out of bounds. All occurrences of $P$ with start-indices within $[p_2..p_1]$, i.e., the leftmost $y$ correspond to leaves found by traversal. From these occurrences, we can derive the occurrences with start-indices within $[p_1..q_1]$, as $W[p_1..q_1]$ is simply further repetitions of $y$.
  • Figure 5: The subtrees starting with $a$ in the suffix trees of $W=\mathtt{axazaz}$ and $W'=\mathtt{xazaz}$. The diamond shape represents the active point. While the active point $\mathit{lrs}=\mathit{lrs}'=\mathtt{az}$ remains unchanged, the locus representation may need to be updated, as it was on the edge $w \rightsquigarrow y$ before deletion and on the edge $x \rightsquigarrow y$ after.
  • ...and 22 more figures

Theorems & Definitions (19)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Lemma 5
  • Lemma 6
  • ...and 9 more