S2vNTM: Semi-supervised vMF Neural Topic Modeling
Weijie Xu, Jay Desai, Srinivasan Sengamedu, Xiaoyu Jiang, Francis Iannacci
TL;DR
This paper tackles semi-supervised topic modeling with limited labeled guidance by introducing S2vNTM, a neural topic model that uses a von Mises–Fisher latent space to cluster topics and incorporate seed keywords. The framework combines an encoder producing $ ext{vMF}(\,oldsymbol{\mu}, \kappa)$ representations, a temperature-adjusted topic distribution, and a decoder that reconstructs inputs while respecting seed guidance through a composite loss $L = L_{\text{Recon}} + L_{\text{KL}} + \beta L_{\text{CE}} + \gamma L_{\text{NS}}$. Key innovations include a keyword–topic matching mechanism that mitigates redundancy, and a negative sampling strategy that pushes unrelated words away from seed topics, all built on dataset-specific spherical embeddings. Empirically, S2vNTM outperforms existing semi-supervised topic models on multiple datasets in classification metrics, and it enables rapid, interactive topic refinement without pretraining. These properties make it practical for real-world scenarios with limited labeled data and the need for explainable, controllable topic discovery.
Abstract
Language model based methods are powerful techniques for text classification. However, the models have several shortcomings. (1) It is difficult to integrate human knowledge such as keywords. (2) It needs a lot of resources to train the models. (3) It relied on large text data to pretrain. In this paper, we propose Semi-Supervised vMF Neural Topic Modeling (S2vNTM) to overcome these difficulties. S2vNTM takes a few seed keywords as input for topics. S2vNTM leverages the pattern of keywords to identify potential topics, as well as optimize the quality of topics' keywords sets. Across a variety of datasets, S2vNTM outperforms existing semi-supervised topic modeling methods in classification accuracy with limited keywords provided. S2vNTM is at least twice as fast as baselines.
