Table of Contents
Fetching ...

Efficient Computation of Periods and Covers Using Sampling

Thierry Lecroq, Francesco Pio Marino

TL;DR

This paper introduces a novel application of Characters-Distance-Sampling (CDS) to compute fundamental string regularities, specifically the period and the shortest cover. It develops CDS-based algorithms that operate directly on the CDS representation with a single pivot (the first character), preserving linear-time behavior while enabling substantial speedups over classical methods. Empirically, the CDS-based approaches achieve speedups in the ranges $38\%$--$43\%$ for period computation and $63\%$--$72\%$ for cover detection, highlighting the practical efficiency and potential of CDS-based string analysis for applications in compression, computational biology, and pattern recognition. The results suggest broader applicability of CDS representations for efficient regularity detection in strings.

Abstract

Identifying regularities in strings, such as \emph{periods} and \emph{covers}, is crucial for applications in text compression, computational biology, and pattern recognition. \emph{Characters-Distance-Sampling} (\texttt{CDS}) is an efficient technique that encodes a string by storing distances between selected pivot characters, accelerating string-processing tasks. We apply \texttt{CDS} to compute periods and shortest covers, selecting only the first character as the pivot. This strategy yields optimized computations, achieving speedups of $38\%$--$43\%$ for period computation and $63\%$--$72\%$ for cover detection. These results demonstrate the potential of \texttt{CDS}-based representations for efficient string analysis and broader applications.

Efficient Computation of Periods and Covers Using Sampling

TL;DR

This paper introduces a novel application of Characters-Distance-Sampling (CDS) to compute fundamental string regularities, specifically the period and the shortest cover. It develops CDS-based algorithms that operate directly on the CDS representation with a single pivot (the first character), preserving linear-time behavior while enabling substantial speedups over classical methods. Empirically, the CDS-based approaches achieve speedups in the ranges -- for period computation and -- for cover detection, highlighting the practical efficiency and potential of CDS-based string analysis for applications in compression, computational biology, and pattern recognition. The results suggest broader applicability of CDS representations for efficient regularity detection in strings.

Abstract

Identifying regularities in strings, such as \emph{periods} and \emph{covers}, is crucial for applications in text compression, computational biology, and pattern recognition. \emph{Characters-Distance-Sampling} (\texttt{CDS}) is an efficient technique that encodes a string by storing distances between selected pivot characters, accelerating string-processing tasks. We apply \texttt{CDS} to compute periods and shortest covers, selecting only the first character as the pivot. This strategy yields optimized computations, achieving speedups of -- for period computation and -- for cover detection. These results demonstrate the potential of \texttt{CDS}-based representations for efficient string analysis and broader applications.
Paper Structure (6 sections, 3 theorems, 10 equations, 3 figures)

This paper contains 6 sections, 3 theorems, 10 equations, 3 figures.

Key Result

lemma thmcounterlemma

Let $x[\delta(i)] = a$ for $0 \leq i \leq \bar{m}-1$. Then:

Figures (3)

  • Figure 1: Border array of $x=\hbox{\tt abaababaaba}$ of length $11$.
  • Figure 2: Border array of the CDS representation of $x=\hbox{\tt abaababaaba}$ of length $11$ with pivot a. Then $\textit{per}(\bar{x}) = 6-\textit{border}_{\bar{x}}[6]=3$, thus $\textit{per}(x) = \bar{x}[0]+\bar{x}[1]+\bar{x}[2]=2+1+2=5$.
  • Figure 3: Border array of the CDS representation of $x=\hbox{\tt abbababbabb}=\hbox{\tt abbababbab}^2$ of length $11$ with pivot a thus $k = 2$. $\textit{border}_{\bar{x}}[3]=1$ but $\bar{x}[\textit{border}_{\bar{x}}[3]=1]=2 \leq k=2$. $\textit{border}_{\bar{x}}[1]=0$ and $\bar{x}[\textit{border}_{\bar{x}}[1]=0]=3 > k=2$. Since we use the border of $\bar{x}$ of length $\textit{border}_{\bar{x}}[1]=0$, we use the period of $\bar{x}$ which is equal to $|\bar{x}|-\textit{border}_{\bar{x}}[1]=3$, then we sum the first $3$ elements of $\bar{x}$ to get $\textit{per}(x) = \bar{x}[0]+\bar{x}[1]+\bar{x}[2]=3+2+3=8$.

Theorems & Definitions (6)

  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof