Table of Contents
Fetching ...

Incremental computation of the set of period sets

Eric Rivals

TL;DR

This work addresses the problem of enumerating and certifying all period sets $\Gamma_n$ for words of length $n$, where the number of such sets $\kappa_n$ grows rapidly. It introduces an incremental, $O(n)$-space approach that derives $\Gamma_n$ from $\Gamma_{n-1}$ via a parental relation, and couples it with multiple certification strategies, including a constructive binary realization that yields witness words for realized period sets. The authors leverage the Guibas–Odlyzko characterizations (forward/backward propagation and predicate $\Xi$) and refine the lifecycle of period sets through the recursive FW limit $\mathrm{rfw}(P)$ and the next extension $e(P)$ to study when sets die or extend. The framework supports practical applications such as assessing the absence probability of words in random texts and provides tools and data for exploring the distribution of period sets with respect to basic period and weight, offering a foundation for further theoretical and algorithmic investigations in combinatorics on words and related domains.

Abstract

Overlaps between words are crucial in many areas of computer science, such as code design, stringology, and bioinformatics. A self overlapping word is characterized by its periods and borders. A period of a word $u$ is the starting position of a suffix of $u$ that is also a prefix $u$, and such a suffix is called a border. Each word of length, say $n>0$, has a set of periods, but not all combinations of integers are sets of periods. Computing the period set of a word $u$ takes linear time in the length of $u$. We address the question of computing, the set, denoted $Γ_n$, of all period sets of words of length $n$. Although period sets have been characterized, there is no formula to compute the cardinality of $Γ_n$ (which is exponential in $n$), and the known dynamic programming algorithm to enumerate $Γ_n$ suffers from its space complexity. We present an incremental approach to compute $Γ_n$ from $Γ_{n-1}$, which reduces the space complexity, and then a constructive certification algorithm useful for verification purposes. The incremental approach defines a parental relation between sets in $Γ_{n-1}$ and $Γ_n$, enabling one to investigate the dynamics of period sets, and their intriguing statistical properties. Moreover, the period set of a word $u$ is the key for computing the absence probability of $u$ in random texts. Thus, knowing $Γ_n$ is useful to assess the significance of word statistics, such as the number of missing words in a random text.

Incremental computation of the set of period sets

TL;DR

This work addresses the problem of enumerating and certifying all period sets for words of length , where the number of such sets grows rapidly. It introduces an incremental, -space approach that derives from via a parental relation, and couples it with multiple certification strategies, including a constructive binary realization that yields witness words for realized period sets. The authors leverage the Guibas–Odlyzko characterizations (forward/backward propagation and predicate ) and refine the lifecycle of period sets through the recursive FW limit and the next extension to study when sets die or extend. The framework supports practical applications such as assessing the absence probability of words in random texts and provides tools and data for exploring the distribution of period sets with respect to basic period and weight, offering a foundation for further theoretical and algorithmic investigations in combinatorics on words and related domains.

Abstract

Overlaps between words are crucial in many areas of computer science, such as code design, stringology, and bioinformatics. A self overlapping word is characterized by its periods and borders. A period of a word is the starting position of a suffix of that is also a prefix , and such a suffix is called a border. Each word of length, say , has a set of periods, but not all combinations of integers are sets of periods. Computing the period set of a word takes linear time in the length of . We address the question of computing, the set, denoted , of all period sets of words of length . Although period sets have been characterized, there is no formula to compute the cardinality of (which is exponential in ), and the known dynamic programming algorithm to enumerate suffers from its space complexity. We present an incremental approach to compute from , which reduces the space complexity, and then a constructive certification algorithm useful for verification purposes. The incremental approach defines a parental relation between sets in and , enabling one to investigate the dynamics of period sets, and their intriguing statistical properties. Moreover, the period set of a word is the key for computing the absence probability of in random texts. Thus, knowing is useful to assess the significance of word statistics, such as the number of missing words in a random text.

Paper Structure

This paper contains 27 sections, 15 theorems, 1 equation, 4 figures, 5 algorithms.

Key Result

Theorem 4

Let $P$ a subset of $\{0, 1, \ldots, n-1 \}$. The four following statements are equivalent: (1) $P$ is the period set of a binary word of length $n$. (2) $P$ is the period set of a word of length $n$. (3) Zero belongs to $P$ and $P$ satisfies the forward and backward propagation rules. (4) $P$ satis

Figures (4)

  • Figure 1: Distribution in $\Gamma_{60}$ of the number of period sets by basic period (left) and by weight (right), for string length of $n := 60$. Beyond basic period $30$, the counts decrease smoothly with the basic period. Between basic period $1$ and $30$ the counts increase to a local maximum when the basic period reaches $\lfloor n/x \rfloor$ for $1 < x \leq 12 =$ (e.g. basic periods 10, 12, 15, 20, 30). The distribution by weight (right) is limited to weight below $22$; it is unimodal and right skewed towards low weights.
  • Figure 2: Distribution in $\Gamma_{48}$ of the number of period sets by basic period (left) and by weight (right), i.e., for string length of $n := 48$.
  • Figure 3: Distribution in $\Gamma_{55}$ of the number of period sets by basic period (left) and by weight (right), i.e., for string length of $n := 55$.
  • Figure 4: Distribution in $\Gamma_{59}$ of the number of period sets by basic period (left) and by weight (right), i.e., for string length of $n := 59$.

Theorems & Definitions (24)

  • Definition 1: Period/border
  • Definition 2: FPR
  • Definition 3: BPR
  • Theorem 4
  • Theorem 5: Lothaire-ACW
  • Theorem 6: Fine and Wilf
  • Theorem 7
  • Theorem 8
  • Corollary 9
  • Lemma 10
  • ...and 14 more