Table of Contents
Fetching ...

Mixtures of Unsupervised Lexicon Classification

Peratham Wiriyathammabhum

TL;DR

The paper tackles unsupervised lexicon classification for text by extending BayesLex with mixture and nonparametric modeling to handle group clustering. It first reframes lexicon weighting as a finite mixture of multinomial Naive Bayes models and then generalizes to Dirichlet Process Mixtures and Hierarchical DP Mixtures, enabling an unbounded set of shared components across groups. A key theoretical insight is that aggregating lexicon scores across multiple lexicons is equivalent to a mixture NB model, which provides a principled justification for score fusion and facilitates nonparametric extensions. The proposed mixture framework offers a flexible, scalable approach for weighting lexicons and classifying documents, with practical implications for filtering and topic-aware lexicon analysis in real-world corpora.

Abstract

This paper presents a mixture version of the method-of-moment unsupervised lexicon classification by an incorporation of a Dirichlet process.

Mixtures of Unsupervised Lexicon Classification

TL;DR

The paper tackles unsupervised lexicon classification for text by extending BayesLex with mixture and nonparametric modeling to handle group clustering. It first reframes lexicon weighting as a finite mixture of multinomial Naive Bayes models and then generalizes to Dirichlet Process Mixtures and Hierarchical DP Mixtures, enabling an unbounded set of shared components across groups. A key theoretical insight is that aggregating lexicon scores across multiple lexicons is equivalent to a mixture NB model, which provides a principled justification for score fusion and facilitates nonparametric extensions. The proposed mixture framework offers a flexible, scalable approach for weighting lexicons and classifying documents, with practical implications for filtering and topic-aware lexicon analysis in real-world corpora.

Abstract

This paper presents a mixture version of the method-of-moment unsupervised lexicon classification by an incorporation of a Dirichlet process.
Paper Structure (14 sections, 20 equations, 3 figures)

This paper contains 14 sections, 20 equations, 3 figures.

Figures (3)

  • Figure 1: A representation of a Naïve Bayes model.
  • Figure 2: A representation of a mixture model. The right plate diagram has a latent variable $z$. In the left plate diagram, $\pi$ is a variable instead of a constant.
  • Figure 3: A representation of a Dirichlet process mixture model.