Corrections of Zipf's and Heaps' Laws Derived from Hapax Rate Models
Łukasz Dębowski
TL;DR
This work tackles the persistent deviations of Zipf's and Heaps' laws by embedding a parametric hapax rate function within the classic urn model, linking hapax dynamics to vocabulary growth and rank-frequency distributions. By deriving analytic expressions for the vocabulary size $g(n)$, the spectrum $g(n|k)$, and the rank function $g(n||f)$ under four hapax-rate models, and testing them on 14 Project Gutenberg texts, the authors demonstrate that a logistic hapax-rate yields the best empirical fit and precise corrections to the canonical laws. The framework also explains why Herdan-Heaps may overestimate type growth and how mixture models can capture vocabulary growth in very large corpora, pointing to future work on lexicon evolution and lexical class structure. Overall, the approach provides a principled, computable toolkit for principled corrections to Zipf's and Heaps' laws with direct linguistic relevance.
Abstract
The article introduces corrections to Zipf's and Heaps' laws based on systematic models of the proportion of hapaxes, i.e., words that occur once. The derivation rests on two assumptions: The first one is the standard urn model which predicts that marginal frequency distributions for shorter texts look as if word tokens were sampled blindly from a given longer text. The second assumption posits that the hapax rate is a simple function of the text length. Four such functions are discussed: the constant model, the Davis model, the linear model, and the logistic model. It is shown that the logistic model yields the best fit.
