Exploring the Zipf Distribution Through the Lens of Mixtures
Marta Pérez-Casany, Ariel Duarte-López, Jordi Valero
TL;DR
The paper addresses how Zipf's law can be represented as mixtures by showing that $$P(X=x)=\frac{x^{-\alpha}}{\zeta(\alpha)}$$ with $\alpha>1$ emerges both as a mixture of geometric distributions with a specific mixing density and as a mixture of zero-truncated Poisson distributions with another mixing density. It further proves that Zipf is not the zero-truncation of a mixed Poisson distribution, and derives a corollary that the Zipf-Poisson Stopped Sum distribution is a mixed Poisson distribution. An empirical illustration using 135 chapters of Moby Dick validates the mixing construction via KS tests and frequency-of-frequencies analyses, linking the theoretical results to real data-generation mechanisms and tail behavior. The work provides exact, non-asymptotic representations that illuminate how Zipf-like data can arise from heterogeneous Poisson-like processes and has implications for modeling heavy-tailed phenomena across disciplines.
Abstract
The Zipf distribution is a probability distribution widely used by scientists from various disciplines due to its ubiquity. Some of these areas include linguistics, physics, genetics, and sociology, among others. In this paper, it is proved that the Zipf distribution is both a mixture of geometric distributions and a mixture of zero-truncated Poisson distributions. It is also shown that it is not the zero-truncation of a mixed Poisson distribution. These results are important because they provide insights on the data generation mechanism that leads to data from a Zipf distribution. Additionally, it is proved, as a corollary, that the Zipf-Poisson Stopped Sum distribution is a particular case of a mixed Poisson distribution. The results are illustrated analyzing the 135 chapters of the novel Moby Dick.
