Table of Contents
Fetching ...

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

Kotaro Kimura, Tomohiro I

TL;DR

The paper addresses efficient enumeration of context-sensitive repeats and net occurrences in strings by improving r-enum to run in $O(n)$ time with $O(r)$ space using a move data structure on the run-length Burrows-Wheeler transform. It extends the framework to NSMRs, SMRs, and context diversity, enabling comprehensive cataloging of repeats with minimal redundancy. It also introduces an $O(r)$-space data structure for NF-queries, proves a tight bound of $2r$ on the total number of net occurrences and MUSs, and shows how to produce the sorted list of net occurrences and MUSs in $O(n)$ time and $O(r)$ space. Collectively, these contributions yield space-efficient, scalable tools for pattern discovery in large strings and genome data, with direct implications for bioinformatics and text analysis.

Abstract

Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string $T$ of length $n$ in $O(r)$ words of compressed working space, where $r \le n$ is the number of runs in the Burrows-Wheeler transform (BWT) of $T$. Given the run-length encoded BWT (RLBWT) of $T$, r-enum runs in $O(n\log\log_{w}(n/r))$ time in addition to the time linear to the number of output strings, where $w=Θ(\log n)$ is the word size. In this paper, we improve the $O(n\log\log_{w}(n/r))$ term to $O(n)$. We also extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, and the context diversity for every maximal repeat in the same complexities. Furthermore, we study the occurrences that witness NSMRs, which have recently attracted attention under the name of net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is defined to be a repeat with a positive net frequency. Given the RLBWT of $T$, we show how to compute the set $S^{nsmr}$ of all NSMRs in $T$ together with their net frequency/occurrences in $O(n)$ time and $O(r)$ space. We also show that an $O(r)$-space data structure can be built from the RLBWT to support queries of computing the net frequency of any query pattern $P$ in $O(|P|)$ time. The data structure is built in $O(r)$ space and in $O(n)$ time with high probability or deterministic $O(n+|S^{nsmr}|\log\log\min(σ,|S^{nsmr}|))$ time, where $σ\le r$ is the alphabet size of $T$. To achieve this, we prove that the total number of net occurrences is less than $2r$. We also get a new upper bound $2r$ of the number of minimal unique substrings in $T$, which may be of independent interest.

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

TL;DR

The paper addresses efficient enumeration of context-sensitive repeats and net occurrences in strings by improving r-enum to run in time with space using a move data structure on the run-length Burrows-Wheeler transform. It extends the framework to NSMRs, SMRs, and context diversity, enabling comprehensive cataloging of repeats with minimal redundancy. It also introduces an -space data structure for NF-queries, proves a tight bound of on the total number of net occurrences and MUSs, and shows how to produce the sorted list of net occurrences and MUSs in time and space. Collectively, these contributions yield space-efficient, scalable tools for pattern discovery in large strings and genome data, with direct implications for bioinformatics and text analysis.

Abstract

Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string of length in words of compressed working space, where is the number of runs in the Burrows-Wheeler transform (BWT) of . Given the run-length encoded BWT (RLBWT) of , r-enum runs in time in addition to the time linear to the number of output strings, where is the word size. In this paper, we improve the term to . We also extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, and the context diversity for every maximal repeat in the same complexities. Furthermore, we study the occurrences that witness NSMRs, which have recently attracted attention under the name of net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is defined to be a repeat with a positive net frequency. Given the RLBWT of , we show how to compute the set of all NSMRs in together with their net frequency/occurrences in time and space. We also show that an -space data structure can be built from the RLBWT to support queries of computing the net frequency of any query pattern in time. The data structure is built in space and in time with high probability or deterministic time, where is the alphabet size of . To achieve this, we prove that the total number of net occurrences is less than . We also get a new upper bound of the number of minimal unique substrings in , which may be of independent interest.

Paper Structure

This paper contains 13 sections, 12 theorems, 2 figures, 2 tables.

Key Result

Lemma 1

Given $S \subseteq [1..u]$ of size $m \le u \le 2^{O(w)}$, we can build an $O(m)$-size dictionary in $O(m)$ time with high probability or deterministic $O(m \log \log m)$ time so that lookup queries can be supported in $O(1)$ worst-case time.

Figures (2)

  • Figure 1: The left figure shows the suffix tree of $T = \mathtt{abcbbcbcabc\$}$ illustrated over sorted suffixes, on which each node $x$ can be represented by $\mathcal{I}(x)$ and $|x|$. A solid box is a node (highlighted for internal nodes), and a dotted box is an implicit node. The right figure shows the Weiner links outgoing from all internal nodes (the Weiner links from leaves are omitted). The Weiner links that points to internal nodes are depicted with solid arrows, and the other ones with dotted arrows. Observe that there is a Weiner link from a node $x$ to $ax$ for any character $a \in \mathsf{lc}(x) = \{ \mathsf{L}[i] \mid i \in \mathcal{I}(x) \}$, where the case with $a = \$$ is excluded unless $x = \varepsilon$.
  • Figure 2: An illustration of the compacted reversed trie $\mathcal{T}$ for $\mathcal{S}^{\mathsf{nsmr}} = \{ \mathtt{bc}, \mathtt{abc}, \mathtt{bcb} \}$ in our running example $T = \mathtt{abcbbcbcabc\$}$. A node corresponding to an NSMR is highlighted. Storing $\mathsf{NOcc}(\cdot)$ is optional. Note that edge labels are not stored explicitly. For example, the string $\mathtt{bcb}$ on the edge from $\varepsilon$ to $\mathtt{bcb}$ is retrieved using FL-mapping $i_{\mathtt{bcb}} - i_{\varepsilon} = 3$ times from $i_{\mathtt{bcb}} = 7$ when necessary.

Theorems & Definitions (20)

  • Lemma 1: 2000Willard_ExaminComputGeometVanEmde2008Ruzic_ConstEfficDictionInClose_ICALP
  • Lemma 2: 2015BelazzouguiN_OptimLowerAndUpperBound
  • Lemma 3: 2002Muthukrishnan_EfficAlgorForDocumRetriev_SODA2020BelazzouguiCKM_LinearTimeStrinIndexAnd
  • Example 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • proof
  • Example 8
  • Lemma 9
  • ...and 10 more