A Crucial Parameter for Rank-Frequency Relation in Natural Languages

Chenchen Ding

A Crucial Parameter for Rank-Frequency Relation in Natural Languages

Chenchen Ding

TL;DR

This work shows that a four-parameter rank-frequency model $f \propto r^{-\alpha} \cdot (r+\gamma)^{-\beta}$ can be effectively analyzed by transforming to a beta-prime distribution through $t = r/(r+\gamma)$, with $\gamma$ acting as the key determinant of vocabulary-growth resistance. By introducing a zeroth word and performing moment-based estimation on the transformed variable $t$, the method converts parameter estimation into learning $\alpha$, $\beta$ for a fixed $\gamma$ and selecting $\gamma$ to stabilize the normalization constant $C$. Empirical results on multilingual data show that optimal $\gamma$ and the resulting $\alpha$, $\beta$ yield predictions with RMSE comparable to post-hoc fitting, supporting the approach and its interpretation in terms of maximum entropy and existence of moments. Case studies across controlled vocabularies, POS categories, and characters reveal when the model is appropriate and how linguistic structure and corpus composition influence the tail/head balance via $\gamma$ and $\alpha+\beta$.

Abstract

$f \propto r^{-α} \cdot (r+γ)^{-β}$ has been empirically shown more precise than a naïve power law $f\propto r^{-α}$ to model the rank-frequency ($r$-$f$) relation of words in natural languages. This work shows that the only crucial parameter in the formulation is $γ$, which depicts the resistance to vocabulary growth on a corpus. A method of parameter estimation by searching an optimal $γ$ is proposed, where a ``zeroth word'' is introduced technically for the calculation. The formulation and parameters are further discussed with several case studies.

A Crucial Parameter for Rank-Frequency Relation in Natural Languages

TL;DR

This work shows that a four-parameter rank-frequency model

can be effectively analyzed by transforming to a beta-prime distribution through

, with

acting as the key determinant of vocabulary-growth resistance. By introducing a zeroth word and performing moment-based estimation on the transformed variable

, the method converts parameter estimation into learning

for a fixed

and selecting

to stabilize the normalization constant

. Empirical results on multilingual data show that optimal

and the resulting

yield predictions with RMSE comparable to post-hoc fitting, supporting the approach and its interpretation in terms of maximum entropy and existence of moments. Case studies across controlled vocabularies, POS categories, and characters reveal when the model is appropriate and how linguistic structure and corpus composition influence the tail/head balance via

and

Abstract

has been empirically shown more precise than a naïve power law

to model the rank-frequency (

) relation of words in natural languages. This work shows that the only crucial parameter in the formulation is

, which depicts the resistance to vocabulary growth on a corpus. A method of parameter estimation by searching an optimal

is proposed, where a ``zeroth word'' is introduced technically for the calculation. The formulation and parameters are further discussed with several case studies.

Paper Structure (15 sections, 15 equations, 6 figures, 1 table)

This paper contains 15 sections, 15 equations, 6 figures, 1 table.

Introduction
Derivation
Estimation
Data
Zeroth Word
Calculation
Experiment
Discussion
Principle of Maximum Entropy
Existence of Moments
Case Studies
Controlled Vocabulary
Part-of-Speech
Characters
Conclusion

Figures (6)

Figure 1: Plots of the Bible of King James Version ( kjv, left) and the Bible in Basic English ( bbe, right). The configuration can be referred to the Appendix.
Figure 2: Plots of words (left) and POS (right) on the Brown corpus. The configuration can be referred to the Appendix.
Figure 3: The $\gamma$-$\log V[C(\gamma)]$ curves on the characters in Bible. The left one is on the English King James Version; the upper/lower curves are for lower-cased/original letters. The right one is on the Chinese Union Version; the upper/lower curves are for simplified/traditional Chinese characters. The configuration can be referred to the Appendix.
Figure :
Figure :
...and 1 more figures

A Crucial Parameter for Rank-Frequency Relation in Natural Languages

TL;DR

Abstract

A Crucial Parameter for Rank-Frequency Relation in Natural Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (6)