Table of Contents
Fetching ...

Quadratic Term Correction on Heaps' Law

Oscar Fontanelli, Wentian Li

TL;DR

Heaps' law does not exactly fit the type-token relation in log-log scale. The paper introduces a quadratic correction in $\log$-space, $\log(V) = c_0 + \alpha \log(T) + \beta (\log(T))^2$, which corresponds to $V = c T^{\alpha + \beta \log(T)}$, and validates it through empirical analysis of twenty English texts and a random-ball drawing (urn) model. The results show $\alpha \approx 1$ and $\beta \approx -0.02$ on average, with the curvature interpretable as a negative pseudo-variance; Zipf exponent modulates the curvature magnitude. Together, the work provides a more accurate description of vocabulary growth and a mechanistic explanation for the observed log-log curvature, with implications for linguistic data fitting and large-corpus analysis.

Abstract

Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

Quadratic Term Correction on Heaps' Law

TL;DR

Heaps' law does not exactly fit the type-token relation in log-log scale. The paper introduces a quadratic correction in -space, , which corresponds to , and validates it through empirical analysis of twenty English texts and a random-ball drawing (urn) model. The results show and on average, with the curvature interpretable as a negative pseudo-variance; Zipf exponent modulates the curvature magnitude. Together, the work provides a more accurate description of vocabulary growth and a mechanistic explanation for the observed log-log curvature, with implications for linguistic data fitting and large-corpus analysis.

Abstract

Heaps' or Herdan's law characterizes the word-type vs. word-token relation by a power-law function, which is concave in linear-linear scale but a straight line in log-log scale. However, it has been observed that even in log-log scale, the type-token curve is still slightly concave, invalidating the power-law relation. At the next-order approximation, we have shown, by twenty English novels or writings (some are translated from another language to English), that quadratic functions in log-log scale fit the type-token data perfectly. Regression analyses of log(type)-log(token) data with both a linear and quadratic term consistently lead to a linear coefficient of slightly larger than 1, and a quadratic coefficient around -0.02. Using the ``random drawing colored ball from the bag with replacement" model, we have shown that the curvature of the log-log scale is identical to a ``pseudo-variance" which is negative. Although a pseudo-variance calculation may encounter numeric instability when the number of tokens is large, due to the large values of pseudo-weights, this formalism provides a rough estimation of the curvature when the number of tokens is small.

Paper Structure

This paper contains 10 sections, 20 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Word-type ($y$-axis) vs. word token ($x$-axis) counts in an English translation of War and Peace, in log-log scale. the grey cloud around, e.g., $x=100$ and $y=70$, are word-type counts in (non-overlapping) moving window of 100 tokens along the text (noise is added to $x=100$). Similar grey clouds are done similar for moving windows of sizes 200, 500, 1000, $\cdots$ tokens. The last point at $x=500,000$ more or less cover the whole book. The orange circles indicate the word-type count in the first window. The blue crosses indicate the median of word-type counts with the same window size. Linear (solid line) and quadratic regressions (dashed line) on all windows (grey points, pink lines) and on medians (blue crosses, blue lines) are shown.
  • Figure 2: Word-type ($y$-axis) vs. token ($x$-axis) counts for 20 books from the Project Gutenberg, in log-log scale.
  • Figure 3: Local slopes in $\log_{10}$(type) over $\log_{10}$(token) plots (also called local elasticity) for the 20 books, as a function of $\log_{10}$) token counts.