Table of Contents
Fetching ...

Non-Zipfian Distribution of Stopwords and Subset Selection Models

Wentian Li, Oscar Fontanelli

TL;DR

A stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^\gamma)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^\gamma)$).

Abstract

Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank $r$ is a decreasing Hill's function ($1/(1+(r/r_{mid})^γ)$); whereas the probability for not being selected is the standard Hill's function ( $1/(1+(r_{mid}/r)^γ)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.

Non-Zipfian Distribution of Stopwords and Subset Selection Models

TL;DR

A stopword (subset) selection model that the probability for being selected as a function of the word's rank is a decreasing Hill's function (); whereas the probability for not being selected is the standard Hill's function ( ).

Abstract

Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf's law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf's law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word's rank is a decreasing Hill's function (); whereas the probability for not being selected is the standard Hill's function ( ). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf's law, as well as explaining the quadratic fitting function for the non-stopwords.
Paper Structure (17 sections, 13 equations, 6 figures, 1 table)

This paper contains 17 sections, 13 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Rank-frequency plot of all words (black) from Moby Dick (top row) and Brown corpus (bottom row), and stopwords (red) for NLTK (left column, $n=$123 after removing the contracted words and/or their components) and for spaCy (right column, $n=$305) lists. The fitting lines for all words are power-law (Zipf) function, and those for stopwords (pink lines) are BRFs. Note that there are four plots for stopwords because the number of combinations between stopword lists and source texts is 4.
  • Figure 2: We use 30 books to estimate the probability of a rank-$r$ word to be a stopword. Each dot shows the statistics of rank-$r$ words: the $y$ is the proportion of 30 rank-$r$ words that are stopword, and $x$ is the rank $r$. Circles are the geometric mean of dots with the same $y$ value. A nonlinear function fitting of Eq.(\ref{['eq-subset-model']}) leads to the estimation of parameters: $\gamma=1.78$ and $r_{mid}=74.9$.
  • Figure 3: Subset sampling model. A rank-$r$ may be selected (subset selection) with a new rank $r_{new}$ within the subset. (A) $r_{new}/r$ as a function of $r$ in 4 combinations of 2 text sources ( Moby Dick and Brown corpus) and 2 stopword lists. Four cumulative sums of the decreasing Hill function Eq.(\ref{['eq-subset-model']})) are shown that best fit the four sets of data. (B) Rank-frequency plot of a simulated dataset from the subset selection model. The red line is a BRF fitting function.
  • Figure 4: Rank-frequency plots for non-stopwords. Top (bottom) row is for Moby Dick ( Brown corpus), and left (right) column is for NLTK stopwords excluding contracted words (spaCy stopwords). The fitting function by power-law (Zipf's) function is in black, and those by quadratic function, $\log(T) \sim - \alpha \log(r) - \kappa (\log(r))^2$ ( Eq.(\ref{['eq-quadratic']})), is in red.
  • Figure 5: A word that does not belong to the stopword list (a non-stopword) is ranked $r$ (${r'}_{new}$ before (after) stopwords are removed. (A) ${r'}_{new}/r$ ratio for non-stopwords in four text source/stopword list combinations. The fitting lines are the cumulative sum from Eq.(\ref{['eq-cumsum2']}) with the best fitting parameters. (B) Rank-frequency plot of an artificial dataset produced by pinking a word whenever its ${r'}_{new}$ according to Eq.(\ref{['eq-cumsum2']}) reaches a new integer. The three fitting lines are BRF (grey), Mandelbrot function (pink), and quadratic function (red).
  • ...and 1 more figures