R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

Kotaro Kimura; Tomohiro I

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

Kotaro Kimura, Tomohiro I

TL;DR

The paper addresses efficient enumeration of context-sensitive repeats and net occurrences in strings by improving r-enum to run in $O(n)$ time with $O(r)$ space using a move data structure on the run-length Burrows-Wheeler transform. It extends the framework to NSMRs, SMRs, and context diversity, enabling comprehensive cataloging of repeats with minimal redundancy. It also introduces an $O(r)$-space data structure for NF-queries, proves a tight bound of $2r$ on the total number of net occurrences and MUSs, and shows how to produce the sorted list of net occurrences and MUSs in $O(n)$ time and $O(r)$ space. Collectively, these contributions yield space-efficient, scalable tools for pattern discovery in large strings and genome data, with direct implications for bioinformatics and text analysis.

Abstract

Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string $T$ of length $n$ in $O(r)$ words of compressed working space, where $r \le n$ is the number of runs in the Burrows-Wheeler transform (BWT) of $T$. Given the run-length encoded BWT (RLBWT) of $T$, r-enum runs in $O(n\log\log_{w}(n/r))$ time in addition to the time linear to the number of output strings, where $w=Θ(\log n)$ is the word size. In this paper, we improve the $O(n\log\log_{w}(n/r))$ term to $O(n)$. We also extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, and the context diversity for every maximal repeat in the same complexities. Furthermore, we study the occurrences that witness NSMRs, which have recently attracted attention under the name of net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is defined to be a repeat with a positive net frequency. Given the RLBWT of $T$, we show how to compute the set $S^{nsmr}$ of all NSMRs in $T$ together with their net frequency/occurrences in $O(n)$ time and $O(r)$ space. We also show that an $O(r)$-space data structure can be built from the RLBWT to support queries of computing the net frequency of any query pattern $P$ in $O(|P|)$ time. The data structure is built in $O(r)$ space and in $O(n)$ time with high probability or deterministic $O(n+|S^{nsmr}|\log\log\min(σ,|S^{nsmr}|))$ time, where $σ\le r$ is the alphabet size of $T$. To achieve this, we prove that the total number of net occurrences is less than $2r$. We also get a new upper bound $2r$ of the number of minimal unique substrings in $T$, which may be of independent interest.

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

TL;DR

The paper addresses efficient enumeration of context-sensitive repeats and net occurrences in strings by improving r-enum to run in

time with

space using a move data structure on the run-length Burrows-Wheeler transform. It extends the framework to NSMRs, SMRs, and context diversity, enabling comprehensive cataloging of repeats with minimal redundancy. It also introduces an

-space data structure for NF-queries, proves a tight bound of

on the total number of net occurrences and MUSs, and shows how to produce the sorted list of net occurrences and MUSs in

time and

space. Collectively, these contributions yield space-efficient, scalable tools for pattern discovery in large strings and genome data, with direct implications for bioinformatics and text analysis.

Abstract

Nishimoto and Tabei [CPM, 2021] proposed r-enum, an algorithm to enumerate various characteristic substrings, including maximal repeats, in a string

of length

words of compressed working space, where

is the number of runs in the Burrows-Wheeler transform (BWT) of

. Given the run-length encoded BWT (RLBWT) of

, r-enum runs in

time in addition to the time linear to the number of output strings, where

is the word size. In this paper, we improve the

term to

. We also extend r-enum to compute other context-sensitive repeats such as near-supermaximal repeats (NSMRs) and supermaximal repeats, and the context diversity for every maximal repeat in the same complexities. Furthermore, we study the occurrences that witness NSMRs, which have recently attracted attention under the name of net occurrences: An occurrence of a repeat is called a net occurrence if it is not covered by another repeat, and the net frequency of a repeat is the number of its net occurrences. With this terminology, an NSMR is defined to be a repeat with a positive net frequency. Given the RLBWT of

, we show how to compute the set

of all NSMRs in

together with their net frequency/occurrences in

time and

space. We also show that an

-space data structure can be built from the RLBWT to support queries of computing the net frequency of any query pattern

time. The data structure is built in

space and in

time with high probability or deterministic

time, where

is the alphabet size of

. To achieve this, we prove that the total number of net occurrences is less than

. We also get a new upper bound

of the number of minimal unique substrings in

, which may be of independent interest.

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

TL;DR

Abstract

R-enum Revisited: Speedup and Extension for Context-Sensitive Repeats and Net Frequencies

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (20)