Table of Contents
Fetching ...

Multiperiodic Processes: Ergodic Sources with a Sublinear Entropy

Łukasz Dębowski

TL;DR

Multiperiodic processes provide a rigorously tractable, ergodic but non-mixing toy model that achieves Hilberg's law with vanishing entropy rate by embedding Zipf-like type frequencies through randomly shifted deterministic sequences. The Infinite Clock algorithm generates these multiperiodic sequences, and the authors develop a suite of statistics (relative frequencies, waiting times, number of observed types) and information-theoretic bounds (seed estimation, block entropy) to characterize the model. They illustrate two regimes—constant and linear periods—showing how period growth controls type-token growth and entropy properties, and under moment conditions, can realize Hilberg-type power laws with tunable exponents. The work connects to broader themes in linguistic statistics and neural scaling, offering a transparent framework that aligns Zipf's law with long-range dependencies observed in language data.

Abstract

We construct multiperiodic processes -- a simple example of stationary ergodic (but not mixing) processes over natural numbers that enjoy the vanishing entropy rate under a mild condition. Multiperiodic processes are supported on randomly shifted deterministic sequences called multiperiodic sequences, which can be efficiently generated using an algorithm called the Infinite Clock. Under a suitable parameterization, multiperiodic sequences exhibit relative frequencies of particular numbers given by Zipf's law. Exactly in the same setting, the respective multiperiodic processes satisfy an asymptotic power-law growth of block entropy, called Hilberg's law. Hilberg's law is deemed to hold for statistical language models, in particular.

Multiperiodic Processes: Ergodic Sources with a Sublinear Entropy

TL;DR

Multiperiodic processes provide a rigorously tractable, ergodic but non-mixing toy model that achieves Hilberg's law with vanishing entropy rate by embedding Zipf-like type frequencies through randomly shifted deterministic sequences. The Infinite Clock algorithm generates these multiperiodic sequences, and the authors develop a suite of statistics (relative frequencies, waiting times, number of observed types) and information-theoretic bounds (seed estimation, block entropy) to characterize the model. They illustrate two regimes—constant and linear periods—showing how period growth controls type-token growth and entropy properties, and under moment conditions, can realize Hilberg-type power laws with tunable exponents. The work connects to broader themes in linguistic statistics and neural scaling, offering a transparent framework that aligns Zipf's law with long-range dependencies observed in language data.

Abstract

We construct multiperiodic processes -- a simple example of stationary ergodic (but not mixing) processes over natural numbers that enjoy the vanishing entropy rate under a mild condition. Multiperiodic processes are supported on randomly shifted deterministic sequences called multiperiodic sequences, which can be efficiently generated using an algorithm called the Infinite Clock. Under a suitable parameterization, multiperiodic sequences exhibit relative frequencies of particular numbers given by Zipf's law. Exactly in the same setting, the respective multiperiodic processes satisfy an asymptotic power-law growth of block entropy, called Hilberg's law. Hilberg's law is deemed to hold for statistical language models, in particular.
Paper Structure (15 sections, 16 theorems, 69 equations, 1 figure, 1 algorithm)

This paper contains 15 sections, 16 theorems, 69 equations, 1 figure, 1 algorithm.

Key Result

Theorem 1

For the multiperiodic sequence $(k_t)_{t\in\mathbb{Z}}$ with periods $\pi_k$ and seeds $\sigma_k$, sequence $(k_{1-t})_{t\in\mathbb{Z}}$ is the multiperiodic sequence with periods $(\pi_k)_{k\in\mathbb{N}}$ and seeds $(\sigma^R_k)_{k\in\mathbb{N}}$ defined as $\sigma^R_k:=\pi_k-\sigma_k+1$.

Figures (1)

  • Figure 1: The number of types $V_t$ as a function of sequence length $t$ for periods $\pi_k=1+\left\lceil ck \right\rceil$ and seeds $\sigma_k=1$. The lines are theoretical predictions $V_t\sim t^{c/(c+1)}$, where the proportionality constant is chosen by least squares.

Theorems & Definitions (37)

  • Definition 1: multiperiodic sequence
  • Example 1: seemingly A001511 in OEIS23
  • Example 2: seemingly A028920 in OEIS23
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Definition 2: multiperiodic process
  • Theorem 3
  • proof
  • ...and 27 more