Table of Contents
Fetching ...

A Crucial Parameter for Rank-Frequency Relation in Natural Languages

Chenchen Ding

TL;DR

This work shows that a four-parameter rank-frequency model $f \propto r^{-\alpha} \cdot (r+\gamma)^{-\beta}$ can be effectively analyzed by transforming to a beta-prime distribution through $t = r/(r+\gamma)$, with $\gamma$ acting as the key determinant of vocabulary-growth resistance. By introducing a zeroth word and performing moment-based estimation on the transformed variable $t$, the method converts parameter estimation into learning $\alpha$, $\beta$ for a fixed $\gamma$ and selecting $\gamma$ to stabilize the normalization constant $C$. Empirical results on multilingual data show that optimal $\gamma$ and the resulting $\alpha$, $\beta$ yield predictions with RMSE comparable to post-hoc fitting, supporting the approach and its interpretation in terms of maximum entropy and existence of moments. Case studies across controlled vocabularies, POS categories, and characters reveal when the model is appropriate and how linguistic structure and corpus composition influence the tail/head balance via $\gamma$ and $\alpha+\beta$.

Abstract

$f \propto r^{-α} \cdot (r+γ)^{-β}$ has been empirically shown more precise than a naïve power law $f\propto r^{-α}$ to model the rank-frequency ($r$-$f$) relation of words in natural languages. This work shows that the only crucial parameter in the formulation is $γ$, which depicts the resistance to vocabulary growth on a corpus. A method of parameter estimation by searching an optimal $γ$ is proposed, where a ``zeroth word'' is introduced technically for the calculation. The formulation and parameters are further discussed with several case studies.

A Crucial Parameter for Rank-Frequency Relation in Natural Languages

TL;DR

This work shows that a four-parameter rank-frequency model can be effectively analyzed by transforming to a beta-prime distribution through , with acting as the key determinant of vocabulary-growth resistance. By introducing a zeroth word and performing moment-based estimation on the transformed variable , the method converts parameter estimation into learning , for a fixed and selecting to stabilize the normalization constant . Empirical results on multilingual data show that optimal and the resulting , yield predictions with RMSE comparable to post-hoc fitting, supporting the approach and its interpretation in terms of maximum entropy and existence of moments. Case studies across controlled vocabularies, POS categories, and characters reveal when the model is appropriate and how linguistic structure and corpus composition influence the tail/head balance via and .

Abstract

has been empirically shown more precise than a naïve power law to model the rank-frequency (-) relation of words in natural languages. This work shows that the only crucial parameter in the formulation is , which depicts the resistance to vocabulary growth on a corpus. A method of parameter estimation by searching an optimal is proposed, where a ``zeroth word'' is introduced technically for the calculation. The formulation and parameters are further discussed with several case studies.
Paper Structure (15 sections, 15 equations, 6 figures, 1 table)

This paper contains 15 sections, 15 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Plots of the Bible of King James Version ( kjv, left) and the Bible in Basic English ( bbe, right). The configuration can be referred to the Appendix.
  • Figure 2: Plots of words (left) and POS (right) on the Brown corpus. The configuration can be referred to the Appendix.
  • Figure 3: The $\gamma$-$\log V[C(\gamma)]$ curves on the characters in Bible. The left one is on the English King James Version; the upper/lower curves are for lower-cased/original letters. The right one is on the Chinese Union Version; the upper/lower curves are for simplified/traditional Chinese characters. The configuration can be referred to the Appendix.
  • Figure :
  • Figure :
  • ...and 1 more figures