KDSTM: Neural Semi-supervised Topic Modeling with Knowledge Distillation
Weijie Xu, Xiaoyu Jiang, Jay Desai, Bin Han, Fuqin Yan, Francis Iannacci
TL;DR
KDSTM tackles the problem of text classification in low-resource settings by fusing neural topic modeling with two guiding mechanisms: Optimal Transport-based topic-to-label alignment and knowledge-distillation guidance from a small labeled set. It trains a neural topic model with fixed corpus embeddings, aligns topics to seed labels via an OT objective, and propagates label information to unlabeled documents through a cosine-similarity–driven distillation loss in a three-stage pipeline. Across multiple datasets, KDSTM achieves higher accuracy, micro-F1, and AUC than supervised topic-modeling baselines and is competitive with weakly supervised methods, while avoiding pretrained embeddings and offering fast training suitable for resource-constrained environments. This approach thus provides a practical and scalable solution for semi-supervised topic classification, especially for low-resource languages and devices.
Abstract
In text classification tasks, fine tuning pretrained language models like BERT and GPT-3 yields competitive accuracy; however, both methods require pretraining on large text datasets. In contrast, general topic modeling methods possess the advantage of analyzing documents to extract meaningful patterns of words without the need of pretraining. To leverage topic modeling's unsupervised insights extraction on text classification tasks, we develop the Knowledge Distillation Semi-supervised Topic Modeling (KDSTM). KDSTM requires no pretrained embeddings, few labeled documents and is efficient to train, making it ideal under resource constrained settings. Across a variety of datasets, our method outperforms existing supervised topic modeling methods in classification accuracy, robustness and efficiency and achieves similar performance compare to state of the art weakly supervised text classification methods.
