Correlation Dimension of Natural Language in a Statistical Manifold

Xin Du; Kumiko Tanaka-Ishii

Correlation Dimension of Natural Language in a Statistical Manifold

Xin Du, Kumiko Tanaka-Ishii

TL;DR

This work reframes natural language as a dynamical system on a statistical manifold and extends the Grassberger-Procaccia correlation-dimension framework to probability distributions using the Fisher-Rao distance. By modeling language states $x_t$ and next-word distributions $p_t$ with a linear mapping $\phi$ and applying dimension-reduction to $q_t$, the authors compute a correlation dimension $\hat{\nu}$ for the sequence $\{p_t\}$, establishing a lower bound $\hat{\nu} \le \nu$ and under certain Markov conditions, arguing $\nu = \hat{\nu}$. Across large-scale multilingual texts and genres, they report a near-universal global fractal dimension $\nu \approx 6.5$, reflecting long-memory and self-similarity in language, with genre- and source-specific variations (e.g., music). The results are reinforced by comparisons to random processes and by demonstrating that Euclidean distances obscure but Fisher-Rao distances reveal the self-similar structure, and a scalable dimension-reduction approach enables extensive analyses of long sequences.

Abstract

The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.

Correlation Dimension of Natural Language in a Statistical Manifold

TL;DR

and next-word distributions

with a linear mapping

and applying dimension-reduction to

, the authors compute a correlation dimension

for the sequence

, establishing a lower bound

and under certain Markov conditions, arguing

. Across large-scale multilingual texts and genres, they report a near-universal global fractal dimension

, reflecting long-memory and self-similarity in language, with genre- and source-specific variations (e.g., music). The results are reinforced by comparisons to random processes and by demonstrating that Euclidean distances obscure but Fisher-Rao distances reveal the self-similar structure, and a scalable dimension-reduction approach enables extensive analyses of long sequences.

Abstract

Paper Structure (25 sections, 5 theorems, 39 equations, 15 figures, 1 table)

This paper contains 25 sections, 5 theorems, 39 equations, 15 figures, 1 table.

Introduction
Method
Results
Properties of the Mapping $\phi:x_t\mapsto p_t$
Formulation
Linearity
Distance Distortion Rate
Dimension Preservation for Markov Processes
When a$_{\geq t}$ and a$_{\geq s}$ Follow the Same Markov Process
When a$_{\geq t}$ and a$_{\geq s}$ Follow Different Markov Processes
GPT-Like Large-Scale Language Models
Dimension Reduction
Local Fractality
Comparison with Dirichlet Distribution
Local Fractals under Small Context Length
...and 10 more sections

Key Result

Lemma 1

(Lower Bound of the Distance Distortion Rate) The distortion rate $r(x_t,x_s)$ is no smaller than 1 for any $x_t$ and $x_s$.

Figures (15)

Figure 1: Our model of language as a stochastic dynamical system. (a) The difference between the system state $x_t$ and the next-word probability distribution $p_t$. (b) $\{p_t\}$ (where $p_t\in \text{Mult}(V)$) as the image of $\{x_t\}$ (where $x_t\in S$) through the marginalization mapping $\phi$ in Formula (\ref{['eq:phi']}). In this study, we use $\hat{\nu}$ to approximate $\nu$.
Figure 2: Sequence of distributions $p_t$ underlying the words in Don Quixote, as visualized for words "," (comma) and ";" (semicolon). Each point represents one timestep. The green points represents timesteps at which $p_t(\text{","})$ dominates and the Shannon entropy $H(p_t) < 2.0$, whereas the orange points correspond to high-entropy states with $H(p_t) > 3.0$. Self-similar patterns are observed in both the green and orange regions.
Figure 3: Correlation integral curves as defined by Formula (\ref{['eq:corrintegral']}) and estimated with GPT2-xl with respect to (a) the maximum-probability threshold $\eta$ in Formula (\ref{['eq:maxprob']}), (b) the sequence length $N$, and (c) the context length $c$ in Formula (\ref{['eq:ctxlen']}).
Figure 4: Correlation dimensions of (a) all books grouped by language, as estimated using GPT2-xl; (b) English books as estimated using GPT with different model sizes (GPT2 from small to xl and the Yi model for 6b and 34b); (c) English texts from various sources with the $R^2$ scores (horizontal axis) of their linear fits to the correlation integral curves; (d) shuffled English books evaluated with GPT2-xl; and (e) English books evaluated with weight-randomized GPT2-xl.
Figure 5: Predicting the probability distribution $p_t$ over a vocabulary with the GPT-xl model, which has 48 layers.
...and 10 more figures

Theorems & Definitions (9)

Lemma 1
proof
Theorem 2
proof
Theorem 3
proof
Corollary 4
Theorem 5
proof

Correlation Dimension of Natural Language in a Statistical Manifold

TL;DR

Abstract

Correlation Dimension of Natural Language in a Statistical Manifold

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (9)