Table of Contents
Fetching ...

Space-Efficient Online Computation of String Net Occurrences

Takuya Mieno, Shunsuke Inenaga

TL;DR

This paper proposes the two following space-efficient alternatives: a sliding-window algorithm of $O(d)$ working space that can report $\mathsf{ENO}(T[i-d+1..i])$ in optimal $O(\#\mathsf{ENO}(T[i-d+1..i]))$ time for each sliding window of size $d$ in $T$.

Abstract

A substring $u$ of a string $T$ is said to be a repeat if $u$ occurs at least twice in $T$. An occurrence $[i..j]$ of a repeat $u$ in $T$ is said to be a net occurrence if each of the substrings $aub = T[i-1..j+1]$, $au = T[i-1..j+1]$, and $ub = T[i..j+1]$ occurs exactly once in $T$. The occurrence $[i-1..j+1]$ of $aub$ is said to be an extended net occurrence of $u$. Let $T$ be an input string of length $n$ over an alphabet of size $σ$, and let $\mathsf{ENO}(T)$ denote the set of extended net occurrences of repeats in $T$. Guo et al. [SPIRE 2024] presented an online algorithm which can report $\mathsf{ENO}(T[1..i])$ in $T[1..i]$ in $O(nσ^2)$ time, for each prefix $T[1..i]$ of $T$. Very recently, Inenaga [arXiv 2024] gave a faster online algorithm that can report $\mathsf{ENO}(T[1..i])$ in optimal $O(\#\mathsf{ENO}(T[1..i]))$ time for each prefix $T[1..i]$ of $T$, where $\#S$ denotes the cardinality of a set $S$. Both of the aforementioned data structures can be maintained in $O(n \log σ)$ time and occupy $O(n)$ space, where the $O(n)$-space requirement comes from the suffix tree data structure. In this paper, we propose the two following space-efficient alternatives: (1) A sliding-window algorithm of $O(d)$ working space that can report $\mathsf{ENO}(T[i-d+1..i])$ in optimal $O(\#\mathsf{ENO}(T[i-d+1..i]))$ time for each sliding window $T[i-d+1..i]$ of size $d$ in $T$. (2) A CDAWG-based online algorithm of $O(e)$ working space that can report $\mathsf{ENO}(T[1..i])$ in optimal $O(\#\mathsf{ENO}(T[1..i]))$ time for each prefix $T[1..i]$ of $T$, where $e < 2n$ is the number of edges in the CDAWG for $T$. All of our proposed data structures can be maintained in $O(n \log σ)$ time for the input online string $T$. We also discuss that the extended net occurrences of repeats in $T$ can be fully characterized in terms of the minimal unique substrings (MUSs) in $T$.

Space-Efficient Online Computation of String Net Occurrences

TL;DR

This paper proposes the two following space-efficient alternatives: a sliding-window algorithm of working space that can report in optimal time for each sliding window of size in .

Abstract

A substring of a string is said to be a repeat if occurs at least twice in . An occurrence of a repeat in is said to be a net occurrence if each of the substrings , , and occurs exactly once in . The occurrence of is said to be an extended net occurrence of . Let be an input string of length over an alphabet of size , and let denote the set of extended net occurrences of repeats in . Guo et al. [SPIRE 2024] presented an online algorithm which can report in in time, for each prefix of . Very recently, Inenaga [arXiv 2024] gave a faster online algorithm that can report in optimal time for each prefix of , where denotes the cardinality of a set . Both of the aforementioned data structures can be maintained in time and occupy space, where the -space requirement comes from the suffix tree data structure. In this paper, we propose the two following space-efficient alternatives: (1) A sliding-window algorithm of working space that can report in optimal time for each sliding window of size in . (2) A CDAWG-based online algorithm of working space that can report in optimal time for each prefix of , where is the number of edges in the CDAWG for . All of our proposed data structures can be maintained in time for the input online string . We also discuss that the extended net occurrences of repeats in can be fully characterized in terms of the minimal unique substrings (MUSs) in .

Paper Structure

This paper contains 9 sections, 19 theorems, 2 figures, 1 algorithm.

Key Result

Lemma 1

For any string $T$, $\mathsf{e}'(T) \leq \mathsf{e}(T)$.

Figures (2)

  • Figure 1: The implicit suffix tree $\mathsf{STree}'(T)$, the explicit suffix tree $\mathsf{STree}(T)$, the implicit CDAWG $\mathsf{CDAWG}'(T)$, and the explicit CDAWG $\mathsf{CDAWG}'(T)$ for string $T = \mathtt{abbbabbabbab}$. The broken arrows represent suffix links. The white and gray stars represent the loci of the longest repeating suffix $\mathtt{bbabbab}$ and shortest quasi-unique suffix $\mathtt{abbab}$ of $T$, respectively.
  • Figure 5: Illustration for Lemma \ref{['lem:ENO_MUS']} and Lemma \ref{['lem:MUS_ENO-new']}.

Theorems & Definitions (19)

  • Lemma 1
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • Theorem 9
  • Lemma 10
  • Theorem 11
  • ...and 9 more