Table of Contents
Fetching ...

Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

TL;DR

This work proposes a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals.

Abstract

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions

TL;DR

This work proposes a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals.

Abstract

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.
Paper Structure (46 sections, 12 equations, 5 figures, 8 tables)

This paper contains 46 sections, 12 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Comparison of reconstruction targets for the topic "lunar exploration" in the 20 Newsgroups dataset: the bag-of-words (BoW) target (left) versus the LLM-derived soft label distribution (right, obtained using LLaMA-3.2-1B). Words highlighted in green boxes are semantically relevant but absent from the BoW target, whereas words in orange boxes appear in the documents yet are semantically distant.
  • Figure 2: Overview of our proposed method. We create training target by prompting the language model to generate a label for the document. In ①, the next token distribution over the vocabulary subset of the instruction prompt is used as the soft targets, where in ② we use the KL Divergence as the reconstruction loss for the neural topic model. In ③, the last hidden state representation of the final token is used as input representation.
  • Figure 3: Results on 20Newsgroups with ERNIE-4.5-0.3B for various temperature $\tau$ values. The values are normalized to visualize relative trends.
  • Figure 4: Results on 20Newsgroups, TweetTopic and Stackoverflow with ERNIE-4.5-0.3B for various temperature $\tau$ values. The metric values are normalized to visualize relative trends.
  • Figure 5: Effect of vocabulary size on topic model performance across the four evaluation metrics on 20NewsGroup. Results are shown for vocabulary sizes of 500, 1,000, 2,000, and 4,000 words. Our method with ERNIE-4.5 is shown in red.