A Crucial Parameter for Rank-Frequency Relation in Natural Languages
Chenchen Ding
TL;DR
This work shows that a four-parameter rank-frequency model $f \propto r^{-\alpha} \cdot (r+\gamma)^{-\beta}$ can be effectively analyzed by transforming to a beta-prime distribution through $t = r/(r+\gamma)$, with $\gamma$ acting as the key determinant of vocabulary-growth resistance. By introducing a zeroth word and performing moment-based estimation on the transformed variable $t$, the method converts parameter estimation into learning $\alpha$, $\beta$ for a fixed $\gamma$ and selecting $\gamma$ to stabilize the normalization constant $C$. Empirical results on multilingual data show that optimal $\gamma$ and the resulting $\alpha$, $\beta$ yield predictions with RMSE comparable to post-hoc fitting, supporting the approach and its interpretation in terms of maximum entropy and existence of moments. Case studies across controlled vocabularies, POS categories, and characters reveal when the model is appropriate and how linguistic structure and corpus composition influence the tail/head balance via $\gamma$ and $\alpha+\beta$.
Abstract
$f \propto r^{-α} \cdot (r+γ)^{-β}$ has been empirically shown more precise than a naïve power law $f\propto r^{-α}$ to model the rank-frequency ($r$-$f$) relation of words in natural languages. This work shows that the only crucial parameter in the formulation is $γ$, which depicts the resistance to vocabulary growth on a corpus. A method of parameter estimation by searching an optimal $γ$ is proposed, where a ``zeroth word'' is introduced technically for the calculation. The formulation and parameters are further discussed with several case studies.
