Table of Contents
Fetching ...

Efficient Algorithms to Compute Closed Substrings

Samkith K Jain, Neerja Mhaskar

TL;DR

This work addresses efficiently enumerating all closed substrings and maximal closed substrings (MCS) of a string with space O(n log n) and time O(n log n). It introduces a compact representation C(w) for closed substrings via the MRC array, and provides two O(n log n) algorithms to compute them: an SA/LCP-based method and a Crochemore-equivalence-class method. It also derives exact MCS counts in Fibonacci words, with asymptotic growth M(f_n) ≈ 1.382 F_n, and reports extensive experimental comparisons showing trade-offs between the approaches across string classes. The results offer practical, scalable tools for closed-factor analysis and open avenues for extending the framework to other word families and parallel implementations.

Abstract

A closed string $u$ is either of length one or contains a border that occurs only as a prefix and as a suffix in $u$ and nowhere else within $u$. In this paper, we present fast $\mathcal{O}(n\log n)$ time algorithms to compute all $\mathcal{O}(n^2)$ closed substrings by introducing a compact representation for all closed substrings of a string $ w[1..n]$, using only $\mathcal{O}(n \log n)$ space. These simple and space-efficient algorithms also compute maximal closed strings. Furthermore, we compare the performance of these algorithms and identify classes of strings where each performs best. Finally, we show that the exact number of MCSs ($M(f_n)$) in a Fibonacci word $ f_n $, for $n \geq 5$, is $\approx \left(1 + \frac{1}{φ^2}\right) F_n \approx 1.382 F_n$, where $ φ$ is the golden ratio.

Efficient Algorithms to Compute Closed Substrings

TL;DR

This work addresses efficiently enumerating all closed substrings and maximal closed substrings (MCS) of a string with space O(n log n) and time O(n log n). It introduces a compact representation C(w) for closed substrings via the MRC array, and provides two O(n log n) algorithms to compute them: an SA/LCP-based method and a Crochemore-equivalence-class method. It also derives exact MCS counts in Fibonacci words, with asymptotic growth M(f_n) ≈ 1.382 F_n, and reports extensive experimental comparisons showing trade-offs between the approaches across string classes. The results offer practical, scalable tools for closed-factor analysis and open avenues for extending the framework to other word families and parallel implementations.

Abstract

A closed string is either of length one or contains a border that occurs only as a prefix and as a suffix in and nowhere else within . In this paper, we present fast time algorithms to compute all closed substrings by introducing a compact representation for all closed substrings of a string , using only space. These simple and space-efficient algorithms also compute maximal closed strings. Furthermore, we compare the performance of these algorithms and identify classes of strings where each performs best. Finally, we show that the exact number of MCSs () in a Fibonacci word , for , is , where is the golden ratio.

Paper Structure

This paper contains 12 sections, 22 theorems, 12 equations, 8 figures, 2 tables, 4 algorithms.

Key Result

Lemma 1

The CMR Algorithm correctly computes all the maximal repetitions in a string $w[1..n]$ in $\mathcal{O}(n \log n)$ time.

Figures (8)

  • Figure 1: Illustration of the second case in the proof of Lemma \ref{['lem:each_prefix_is_closed']}, showing that $|p_j|$ can be computed from $|b_j|$, $|b_{j-1}|$ and $|r_j|$, such that $p_j = r_j$ or $p_j \in E_{r_j}$. Note that when $p_j = r_j$, $p_j' = \varepsilon$.
  • Figure 2: A simple linear scan to add all maximal right-closed substrings of length $1$ to the $\mathcal{MRC}$ array.
  • Figure 3: Equivalence classes across successive levels in the CMR Algorithm for the string $w[1..11] = mississippi$, illustrating the identification of maximal right-closed substrings from level transitions.
  • Figure 4: Execution time comparison between Algorithm \ref{['alg:mrc']} ($\mathsf{SA}$ & $\mathsf{LCP}$ based) and Algorithm \ref{['alg:mrc_crochemore']} (CMR based) on Fibonacci and Tribonacci words.
  • Figure 5: Execution time comparison between Algorithm \ref{['alg:mrc']} ($\mathsf{SA}$ & $\mathsf{LCP}$ based) and Algorithm \ref{['alg:mrc_crochemore']} (CMR based) on aperiodic Thue-Morse words and digits of $\pi$.
  • ...and 3 more figures

Theorems & Definitions (33)

  • Lemma 1: Theorem 7 in CROCHEMORE1981244
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4: Theorem 1 in kosolobov2024closedrepeats
  • Theorem 1
  • Definition 1: $\mathcal{MRC}$ Array
  • Lemma 5: Jakub2015
  • Theorem 2
  • ...and 23 more