S2vNTM: Semi-supervised vMF Neural Topic Modeling

Weijie Xu; Jay Desai; Srinivasan Sengamedu; Xiaoyu Jiang; Francis Iannacci

S2vNTM: Semi-supervised vMF Neural Topic Modeling

Weijie Xu, Jay Desai, Srinivasan Sengamedu, Xiaoyu Jiang, Francis Iannacci

TL;DR

This paper tackles semi-supervised topic modeling with limited labeled guidance by introducing S2vNTM, a neural topic model that uses a von Mises–Fisher latent space to cluster topics and incorporate seed keywords. The framework combines an encoder producing $ ext{vMF}(\,oldsymbol{\mu}, \kappa)$ representations, a temperature-adjusted topic distribution, and a decoder that reconstructs inputs while respecting seed guidance through a composite loss $L = L_{\text{Recon}} + L_{\text{KL}} + \beta L_{\text{CE}} + \gamma L_{\text{NS}}$. Key innovations include a keyword–topic matching mechanism that mitigates redundancy, and a negative sampling strategy that pushes unrelated words away from seed topics, all built on dataset-specific spherical embeddings. Empirically, S2vNTM outperforms existing semi-supervised topic models on multiple datasets in classification metrics, and it enables rapid, interactive topic refinement without pretraining. These properties make it practical for real-world scenarios with limited labeled data and the need for explainable, controllable topic discovery.

Abstract

Language model based methods are powerful techniques for text classification. However, the models have several shortcomings. (1) It is difficult to integrate human knowledge such as keywords. (2) It needs a lot of resources to train the models. (3) It relied on large text data to pretrain. In this paper, we propose Semi-Supervised vMF Neural Topic Modeling (S2vNTM) to overcome these difficulties. S2vNTM takes a few seed keywords as input for topics. S2vNTM leverages the pattern of keywords to identify potential topics, as well as optimize the quality of topics' keywords sets. Across a variety of datasets, S2vNTM outperforms existing semi-supervised topic modeling methods in classification accuracy with limited keywords provided. S2vNTM is at least twice as fast as baselines.

S2vNTM: Semi-supervised vMF Neural Topic Modeling

TL;DR

representations, a temperature-adjusted topic distribution, and a decoder that reconstructs inputs while respecting seed guidance through a composite loss

. Key innovations include a keyword–topic matching mechanism that mitigates redundancy, and a negative sampling strategy that pushes unrelated words away from seed topics, all built on dataset-specific spherical embeddings. Empirically, S2vNTM outperforms existing semi-supervised topic models on multiple datasets in classification metrics, and it enables rapid, interactive topic refinement without pretraining. These properties make it practical for real-world scenarios with limited labeled data and the need for explainable, controllable topic discovery.

Abstract

Paper Structure (25 sections, 15 equations, 9 figures, 3 tables)

This paper contains 25 sections, 15 equations, 9 figures, 3 tables.

Introduction
Method
vNTM
Loss Function
Topic and Keywords set Matching
Negative Sampling
Results
Conclusion and Future Work
Modularity of S2vNTM
Temperature function
Related Work and Challenges
Weakly-supervised text classification
Topic Modeling
Semi-supervised Topic Modeling
Negative Sampling
...and 10 more sections

Figures (9)

Figure 1: An S2vNTM application scenario. Human experts define topic keywords set and the number of topics first. During the training procedure, S2vNTM outputs keywords for each topic by merging the redundant keywords group and identifying new topics. Human experts then confirm/remove the keywords and/or add new keywords. S2vNTM continues refining the keyword list with a fast fine-tuning procedure. After a few iterations, S2vNTM provides users topics with high-quality keywords and high topic classification accuracy.
Figure 2: The neural network architecture of S2vNTM. We denote the dimension of the data in the bracket. $n$ is the number of documents. $v$ is the number of vocabularies. $t$ is the number of topics. $e$ is the dimension of embeddings. Word Embedding(green) is fixed during the training. Pink represents user provided data. Orange denotes all loss function including $L_{KL}$, $L_{Recon}$, $L_{CE}$ and $L_{NS}$
Figure 3: Impact of increasing temperature of vMF VS various metrics on AG News for S2vNTM model.
Figure 4: Effect of gamma. (y-axis on right shows mean.)
Figure 5: Results for Accuracy, Topic Diversity, Macro F1 and aucroc for GuidedLDA, CoreEx and S2vNTM. (Right y-axis shows mean).
...and 4 more figures

S2vNTM: Semi-supervised vMF Neural Topic Modeling

TL;DR

Abstract

S2vNTM: Semi-supervised vMF Neural Topic Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)