Table of Contents
Fetching ...

Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

Zeyad Ahmed, Paul Sheridan, Michael McIsaac, Aitazaz A. Farooque

Abstract

TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

Abstract

TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

Paper Structure

This paper contains 28 sections, 59 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Beta-binomial samples of target term $t_i$ counts in documents with fixed $\alpha_i/\alpha_{0i} = 0.3$ and increasing precision $\alpha_{0i}~\in~\{0.1, 1.0, 10, 100\}$. Each point represents the count of $t_i$ in a single document of length $n_j=100$.
  • Figure 2: Comparison of the approximate beta-binomial and penalized beta-binomial language models on a simulated document collection of size $d = 40$, with all documents having equal length $n_j = 50$$(\forall j)$, and true parameter values $\alpha_i = 0.10$ and $\alpha_{\neg i} = 49.90$. The log-likelihood contour (upper left) and surface (upper right) plots for the beta-binomial approximation show a pronounced ridge structure. In contrast, the corresponding plots for the penalized beta-binomial approximation (bottom left and bottom right) show that the gamma penalty term breaks the ridge by regularizing the objective function. Gamma parameter values of $\mu=50$ and $\sigma^2=4$ were used to generate the plots. The choice $\sigma^2 = 4$ was made to make the breaking of the ridge visually apparent. However, alternative values of $\sigma^2$ produce the same qualitative effect, differing only in the steepness of the surface gradient.
  • Figure 3: Scatterplot of the relationship between the likelihood–ratio test score $\lambda_i$ and the total TF–IDF weight of term $t_i$ across all collection documents. Point diameter is proportional to Dirichlet concentration parameter $\alpha_i$ magnitude.
  • Figure 4: Empirical distributions of the fitted beta-binomial parameters $\alpha_i$ and $\alpha_{\neg i}$ for the 20 Newsgroups dataset vocabulary, shown on a natural logarithmic scale. The pronounced concentrations of mass at $\alpha_i \ll 1$ and $\alpha_{\neg i} \gg 1$ support the assumptions of Claim \ref{['claim:cl1']}.
  • Figure 5: Empirical distributions of the fitted beta-binomial parameters $\alpha_i$ and $\alpha_{\neg i}$ for the R8 dataset vocabulary, shown on a natural logarithmic scale. The pronounced concentrations of mass at $\alpha_i \ll 1$ and $\alpha_{\neg i} \gg 1$ support the assumptions of Claim \ref{['claim:cl1']}.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Claim 1
  • proof : Derivation
  • Claim 2
  • proof : Derivation
  • Claim 3
  • proof : Derivation