Table of Contents
Fetching ...

Exploring the Zipf Distribution Through the Lens of Mixtures

Marta Pérez-Casany, Ariel Duarte-López, Jordi Valero

TL;DR

The paper addresses how Zipf's law can be represented as mixtures by showing that $$P(X=x)=\frac{x^{-\alpha}}{\zeta(\alpha)}$$ with $\alpha>1$ emerges both as a mixture of geometric distributions with a specific mixing density and as a mixture of zero-truncated Poisson distributions with another mixing density. It further proves that Zipf is not the zero-truncation of a mixed Poisson distribution, and derives a corollary that the Zipf-Poisson Stopped Sum distribution is a mixed Poisson distribution. An empirical illustration using 135 chapters of Moby Dick validates the mixing construction via KS tests and frequency-of-frequencies analyses, linking the theoretical results to real data-generation mechanisms and tail behavior. The work provides exact, non-asymptotic representations that illuminate how Zipf-like data can arise from heterogeneous Poisson-like processes and has implications for modeling heavy-tailed phenomena across disciplines.

Abstract

The Zipf distribution is a probability distribution widely used by scientists from various disciplines due to its ubiquity. Some of these areas include linguistics, physics, genetics, and sociology, among others. In this paper, it is proved that the Zipf distribution is both a mixture of geometric distributions and a mixture of zero-truncated Poisson distributions. It is also shown that it is not the zero-truncation of a mixed Poisson distribution. These results are important because they provide insights on the data generation mechanism that leads to data from a Zipf distribution. Additionally, it is proved, as a corollary, that the Zipf-Poisson Stopped Sum distribution is a particular case of a mixed Poisson distribution. The results are illustrated analyzing the 135 chapters of the novel Moby Dick.

Exploring the Zipf Distribution Through the Lens of Mixtures

TL;DR

The paper addresses how Zipf's law can be represented as mixtures by showing that with emerges both as a mixture of geometric distributions with a specific mixing density and as a mixture of zero-truncated Poisson distributions with another mixing density. It further proves that Zipf is not the zero-truncation of a mixed Poisson distribution, and derives a corollary that the Zipf-Poisson Stopped Sum distribution is a mixed Poisson distribution. An empirical illustration using 135 chapters of Moby Dick validates the mixing construction via KS tests and frequency-of-frequencies analyses, linking the theoretical results to real data-generation mechanisms and tail behavior. The work provides exact, non-asymptotic representations that illuminate how Zipf-like data can arise from heterogeneous Poisson-like processes and has implications for modeling heavy-tailed phenomena across disciplines.

Abstract

The Zipf distribution is a probability distribution widely used by scientists from various disciplines due to its ubiquity. Some of these areas include linguistics, physics, genetics, and sociology, among others. In this paper, it is proved that the Zipf distribution is both a mixture of geometric distributions and a mixture of zero-truncated Poisson distributions. It is also shown that it is not the zero-truncation of a mixed Poisson distribution. These results are important because they provide insights on the data generation mechanism that leads to data from a Zipf distribution. Additionally, it is proved, as a corollary, that the Zipf-Poisson Stopped Sum distribution is a particular case of a mixed Poisson distribution. The results are illustrated analyzing the 135 chapters of the novel Moby Dick.

Paper Structure

This paper contains 9 sections, 3 theorems, 28 equations, 4 figures, 2 tables.

Key Result

Theorem 1

The Zipf($\alpha$) distribution is a mixture of geometric distributions with domain $\{1,2,3,\cdots\}$ and parameter $s = -log(1-p)$, with mixing distribution:

Figures (4)

  • Figure 1: PMFs of the Zipf distribution for $\alpha = 1.5, 2, 3.5$ and $5$. On the left-hand side: normal scale. On the right-hand side: log-log scale.
  • Figure 2: The mixing distribution of Theorem \ref{['prop:teo-geozipf']}, as a function of $s$ (on the left-hand side) and as a function of $p$ (on the right-hand side), for different $\alpha$ values.
  • Figure 3: The mixing distribution of Theorem \ref{['prop:teorema2']} a), as a function of $\lambda$ for $\alpha=$ 1.1, 1.5, 2, 3.5 and 5.
  • Figure 4: On the left-hand side, the frequencies of frequencies are shown in log-log scale, along with the fit obtained using a Zipf($\hat{\alpha}$) distribution. On the right-hand side, the empirical cumulative probabilities are displayed, along with the theoretical cumulative probability distribution function for chapters 1 (top) and 135 (bottom).

Theorems & Definitions (6)

  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Corollary 1
  • proof