Statistical Properties of the Rooted-Tree Encoding of $\mathbb{N}$
Pierluigi Contucci, Claudio Giberti, Godwin Osabutey, Cecilia Vernia
TL;DR
This work studies a deterministic text produced by iterated prime-factorisation of natural numbers, encoded as Dyck words representing planar rooted trees. It provides a comprehensive empirical characterization across dictionary growth, orientation, entropy, and compression, revealing sublinear dictionary expansion, persistent redundancy, and a parabolic, self-similar rank-frequency envelope rather than Zipf's law. The analysis uncovers two-regime correlation structures in associated walks (normal or near-diffusive at short times transitioning to superdiffusive at longer scales) and nontrivial cross-correlations between Dyck-word walks. These findings establish a rigorous empirical baseline for understanding the arithmetic-textual structure and motivate future theoretical explanations and learnability studies, including transformer-based analyses on a fully deterministic corpus.
Abstract
We prime-encode the natural numbers via recursive factorisation, iterated to the exponents, generating a corpus of planar rooted trees equivalently represented as Dyck words. This forms a deterministic text endowed with internal rules. Statistical analysis of the corpus reveals that the dictionary and the entropy grow sublinearly, compression shows non-monotonic trend, and the rank-frequency curves assume a stable parabolic form deviating from Zipf's law. Correlation analysis using mean-squared displacement reveals a transition from normal diffusion to superdiffusion in the associated walk. These findings characterise the tree-encoded sequence as a statistically structured text with long-range correlations grounded in its generative arithmetic law, providing an empirical basis for subsequent theoretical and learnability
