Quadratic Term Correction on Heaps' Law
Oscar Fontanelli, Wentian Li
TL;DR
Heaps' law does not exactly fit the type-token relation in log-log scale. The paper introduces a quadratic correction in $\log$-space, $\log(V) = c_0 + \alpha \log(T) + \beta (\log(T))^2$, which corresponds to $V = c T^{\alpha + \beta \log(T)}$, and validates it through empirical analysis of twenty English texts and a random-ball drawing (urn) model. The results show $\alpha \approx 1$ and $\beta \approx -0.02$ on average, with the curvature interpretable as a negative pseudo-variance; Zipf exponent modulates the curvature magnitude. Together, the work provides a more accurate description of vocabulary growth and a mechanistic explanation for the observed log-log curvature, with implications for linguistic data fitting and large-corpus analysis.
Abstract
Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.
