Semiparametric Latent Topic Modeling on Consumer-Generated Corpora
Dominic B. Dayta, Erniel B. Barrios
TL;DR
The paper addresses overfitting and sparse-topic reconstruction in traditional topic models when analyzing consumer-generated corpora. It introduces SemiparTM, a two-stage semiparametric framework that first uses nonnegative matrix factorization to extract a dictionary X and topic expressions B, then applies semiparametric regression to relate these expressions to auxiliary document information Z for predicting topics in new documents. Across simulations and a real customer feedback dataset, SemiparTM, especially the cross-validated variant, achieves competitive or superior cosine similarity with true topic structures compared to LSA, PLSA, and LDA, with particular strength for small vocabularies and corpora. The approach demonstrates practical benefits for automated VOC analysis, enabling interpretable and predictive topic modeling with limited data and available auxiliary information.
Abstract
Legacy procedures for topic modelling have generally suffered problems of overfitting and a weakness towards reconstructing sparse topic structures. With motivation from a consumer-generated corpora, this paper proposes semiparametric topic model, a two-step approach utilizing nonnegative matrix factorization and semiparametric regression in topic modeling. The model enables the reconstruction of sparse topic structures in the corpus and provides a generative model for predicting topics in new documents entering the corpus. Assuming the presence of auxiliary information related to the topics, this approach exhibits better performance in discovering underlying topic structures in cases where the corpora are small and limited in vocabulary. In an actual consumer feedback corpus, the model also demonstrably provides interpretable and useful topic definitions comparable with those produced by other methods.
