Zipf Distributions from Two-Stage Symbolic Processes: Stability Under Stochastic Lexical Filtering
Vladimir Berman
TL;DR
This paper addresses the origin of Zipf-like word-frequency distributions by proposing a two-stage symbolic framework: the Full Combinatorial Word Model (FCWM) generates words from an alphabet with a geometric length distribution, and a Stochastic Lexical Filter (SLF) prunes the space to a plausible lexicon. Analytically, it derives a Zipf-type tail $p(R) \sim R^{-\alpha}$ under broad lexicon-growth profiles $T_k \sim C m^{\gamma k}$, with the exponent $\alpha = \dfrac{1}{\gamma}\left(1 - \dfrac{\ln(1-q)}{\ln m}\right)$, showing tail stability to filtering. Numerical simulations with $m=26$ and $q=0.18$ reproduce a flat head and a tail exponent $\alpha \approx 1.32$, aligning with empirical exponents from English, Russian, and Brown corpora and supporting a geometric interpretation over optimization-based explanations. The study suggests Zipf’s law arises as a universal geometric signature of symbolic combinatorics, robust to lexical pruning and independent of grammar, meaning, or communicative pressures.
Abstract
Zipf's law in language lacks a definitive origin, debated across fields. This study explains Zipf-like behavior using geometric mechanisms without linguistic elements. The Full Combinatorial Word Model (FCWM) forms words from a finite alphabet, generating a geometric distribution of word lengths. Interacting exponential forces yield a power-law rank-frequency curve, determined by alphabet size and blank symbol probability. Simulations support predictions, matching English, Russian, and mixed-genre data. The symbolic model suggests Zipf-type laws arise from geometric constraints, not communicative efficiency.
