Mixtures of Unsupervised Lexicon Classification
Peratham Wiriyathammabhum
TL;DR
The paper tackles unsupervised lexicon classification for text by extending BayesLex with mixture and nonparametric modeling to handle group clustering. It first reframes lexicon weighting as a finite mixture of multinomial Naive Bayes models and then generalizes to Dirichlet Process Mixtures and Hierarchical DP Mixtures, enabling an unbounded set of shared components across groups. A key theoretical insight is that aggregating lexicon scores across multiple lexicons is equivalent to a mixture NB model, which provides a principled justification for score fusion and facilitates nonparametric extensions. The proposed mixture framework offers a flexible, scalable approach for weighting lexicons and classifying documents, with practical implications for filtering and topic-aware lexicon analysis in real-world corpora.
Abstract
This paper presents a mixture version of the method-of-moment unsupervised lexicon classification by an incorporation of a Dirichlet process.
