Table of Contents
Fetching ...

Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models

Łukasz Dębowski

TL;DR

This work tackles the persistent deviations of Zipf's and Heaps' laws by embedding a parametric hapax rate function within the classic urn model, linking hapax dynamics to vocabulary growth and rank-frequency distributions. By deriving analytic expressions for the vocabulary size $g(n)$, the spectrum $g(n|k)$, and the rank function $g(n||f)$ under four hapax-rate models, and testing them on 14 Project Gutenberg texts, the authors demonstrate that a logistic hapax-rate yields the best empirical fit and precise corrections to the canonical laws. The framework also explains why Herdan-Heaps may overestimate type growth and how mixture models can capture vocabulary growth in very large corpora, pointing to future work on lexicon evolution and lexical class structure. Overall, the approach provides a principled, computable toolkit for principled corrections to Zipf's and Heaps' laws with direct linguistic relevance.

Abstract

The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.

Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models

TL;DR

This work tackles the persistent deviations of Zipf's and Heaps' laws by embedding a parametric hapax rate function within the classic urn model, linking hapax dynamics to vocabulary growth and rank-frequency distributions. By deriving analytic expressions for the vocabulary size , the spectrum , and the rank function under four hapax-rate models, and testing them on 14 Project Gutenberg texts, the authors demonstrate that a logistic hapax-rate yields the best empirical fit and precise corrections to the canonical laws. The framework also explains why Herdan-Heaps may overestimate type growth and how mixture models can capture vocabulary growth in very large corpora, pointing to future work on lexicon evolution and lexical class structure. Overall, the approach provides a principled, computable toolkit for principled corrections to Zipf's and Heaps' laws with direct linguistic relevance.

Abstract

The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.
Paper Structure (20 sections, 87 equations, 29 figures, 3 tables)

This paper contains 20 sections, 87 equations, 29 figures, 3 tables.

Figures (29)

  • Figure 1: The $U$-shaped hapax rate function for a mixture of the Davis model with $\alpha=10$ and the maximal model. The weight of the Davis model is $(1-\lambda)$ and the weight of the maximal model is $\lambda$.
  • Figure 2: W. Shakespeare, First Folio/35 Plays.
  • Figure 3: W. Shakespeare, First Folio/35 Plays.
  • Figure 4: W. Cather, One of Ours.
  • Figure 5: W. Cather, One of Ours.
  • ...and 24 more figures