Table of Contents
Fetching ...

Statistical Properties of the Rooted-Tree Encoding of $\mathbb{N}$

Pierluigi Contucci, Claudio Giberti, Godwin Osabutey, Cecilia Vernia

TL;DR

This work studies a deterministic text produced by iterated prime-factorisation of natural numbers, encoded as Dyck words representing planar rooted trees. It provides a comprehensive empirical characterization across dictionary growth, orientation, entropy, and compression, revealing sublinear dictionary expansion, persistent redundancy, and a parabolic, self-similar rank-frequency envelope rather than Zipf's law. The analysis uncovers two-regime correlation structures in associated walks (normal or near-diffusive at short times transitioning to superdiffusive at longer scales) and nontrivial cross-correlations between Dyck-word walks. These findings establish a rigorous empirical baseline for understanding the arithmetic-textual structure and motivate future theoretical explanations and learnability studies, including transformer-based analyses on a fully deterministic corpus.

Abstract

We prime-encode the natural numbers via recursive factorisation, iterated to the exponents, generating a corpus of planar rooted trees equivalently represented as Dyck words. This forms a deterministic text endowed with internal rules. Statistical analysis of the corpus reveals that the dictionary and the entropy grow sublinearly, compression shows non-monotonic trend, and the rank-frequency curves assume a stable parabolic form deviating from Zipf's law. Correlation analysis using mean-squared displacement reveals a transition from normal diffusion to superdiffusion in the associated walk. These findings characterise the tree-encoded sequence as a statistically structured text with long-range correlations grounded in its generative arithmetic law, providing an empirical basis for subsequent theoretical and learnability

Statistical Properties of the Rooted-Tree Encoding of $\mathbb{N}$

TL;DR

This work studies a deterministic text produced by iterated prime-factorisation of natural numbers, encoded as Dyck words representing planar rooted trees. It provides a comprehensive empirical characterization across dictionary growth, orientation, entropy, and compression, revealing sublinear dictionary expansion, persistent redundancy, and a parabolic, self-similar rank-frequency envelope rather than Zipf's law. The analysis uncovers two-regime correlation structures in associated walks (normal or near-diffusive at short times transitioning to superdiffusive at longer scales) and nontrivial cross-correlations between Dyck-word walks. These findings establish a rigorous empirical baseline for understanding the arithmetic-textual structure and motivate future theoretical explanations and learnability studies, including transformer-based analyses on a fully deterministic corpus.

Abstract

We prime-encode the natural numbers via recursive factorisation, iterated to the exponents, generating a corpus of planar rooted trees equivalently represented as Dyck words. This forms a deterministic text endowed with internal rules. Statistical analysis of the corpus reveals that the dictionary and the entropy grow sublinearly, compression shows non-monotonic trend, and the rank-frequency curves assume a stable parabolic form deviating from Zipf's law. Correlation analysis using mean-squared displacement reveals a transition from normal diffusion to superdiffusion in the associated walk. These findings characterise the tree-encoded sequence as a statistically structured text with long-range correlations grounded in its generative arithmetic law, providing an empirical basis for subsequent theoretical and learnability

Paper Structure

This paper contains 10 sections, 26 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Decorated tree representation of the integer $3099363912 = 2^3 \cdot 3^{18} = 2^3 \cdot 3^{2 \cdot 3^2}$
  • Figure 2: A tree and the corresponding Dyke word 110011011000.
  • Figure 3: The number of different Dyck words $d_n=|{\mathbb{N}\mathcal{D}}_{1}^{n} |$ in the $\mathbb{N}$atural Text from position 1 to $n$ is represented versus $n$ (black dots). The dashed line represents the power-law fit to the data. In the inset the same data are represented in a log-log plot.
  • Figure 4: Statistical Symmetry ${\rm SS}(k,n)$, defined in \ref{['eq:defSym']}, as a function of the tuple size $k$ for several values of the length of ${\mathbb{N}\mathcal{T}}_{1}^{n}$.
  • Figure 5: Empirical entropy (blue dots) and compression ratio (red squares), as functions of sequence length $n$, together with model fits. Blue crosses represent the theoretical upper bound of entropy, given by $\log_2 |{\mathbb{N}\mathcal{D}}_{1}^{n}|$. The diamond marks the point where the compression rate reaches its minimum.
  • ...and 8 more figures