Correlation Dimension of Natural Language in a Statistical Manifold
Xin Du, Kumiko Tanaka-Ishii
TL;DR
This work reframes natural language as a dynamical system on a statistical manifold and extends the Grassberger-Procaccia correlation-dimension framework to probability distributions using the Fisher-Rao distance. By modeling language states $x_t$ and next-word distributions $p_t$ with a linear mapping $\phi$ and applying dimension-reduction to $q_t$, the authors compute a correlation dimension $\hat{\nu}$ for the sequence $\{p_t\}$, establishing a lower bound $\hat{\nu} \le \nu$ and under certain Markov conditions, arguing $\nu = \hat{\nu}$. Across large-scale multilingual texts and genres, they report a near-universal global fractal dimension $\nu \approx 6.5$, reflecting long-memory and self-similarity in language, with genre- and source-specific variations (e.g., music). The results are reinforced by comparisons to random processes and by demonstrating that Euclidean distances obscure but Fisher-Rao distances reveal the self-similar structure, and a scalable dimension-reduction approach enables extensive analyses of long sequences.
Abstract
The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.
