R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies
Kotaro Kimura, Tomohiro I
TL;DR
The paper addresses efficient enumeration of context-sensitive repeats and net occurrences in strings by improving r-enum to run in $O(n)$ time with $O(r)$ space using a move data structure on the run-length Burrows-Wheeler transform. It extends the framework to NSMRs, SMRs, and context diversity, enabling comprehensive cataloging of repeats with minimal redundancy. It also introduces an $O(r)$-space data structure for NF-queries, proves a tight bound of $2r$ on the total number of net occurrences and MUSs, and shows how to produce the sorted list of net occurrences and MUSs in $O(n)$ time and $O(r)$ space. Collectively, these contributions yield space-efficient, scalable tools for pattern discovery in large strings and genome data, with direct implications for bioinformatics and text analysis.
Abstract
Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string $T$ of length $n$ in $O(r)$ words of compressed working space, where $r \le n$ is the number of runs in the Burrows-Wheeler transform (BWT) of $T$. Given the run-length encoded BWT (RLBWT) of $T$, r-enum runs in $O(n\log\log_{w}(n/r))$ time in addition to the time linear to the number of output strings, where $w=Θ(\log n)$ is the word size. In this paper, we improve the $O(n\log\log_{w}(n/r))$ term to $O(n)$. We also extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, and the context diversity for every maximal repeat in the same complexities. Furthermore, we study the occurrences that witness NSMRs, which have recently attracted attention under the name of net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is defined to be a repeat with a positive net frequency. Given the RLBWT of $T$, we show how to compute the set $S^{nsmr}$ of all NSMRs in $T$ together with their net frequency/occurrences in $O(n)$ time and $O(r)$ space. We also show that an $O(r)$-space data structure can be built from the RLBWT to support queries of computing the net frequency of any query pattern $P$ in $O(|P|)$ time. The data structure is built in $O(r)$ space and in $O(n)$ time with high probability or deterministic $O(n+|S^{nsmr}|\log\log\min(σ,|S^{nsmr}|))$ time, where $σ\le r$ is the alphabet size of $T$. To achieve this, we prove that the total number of net occurrences is less than $2r$. We also get a new upper bound $2r$ of the number of minimal unique substrings in $T$, which may be of independent interest.
