Table of Contents
Fetching ...

CAST: Corpus-Aware Self-similarity Enhanced Topic modelling

Yanan Ma, Chenghao Xiao, Chenhan Yuan, Sabine N van der Veer, Lamiece Hassan, Chenghua Lin, Goran Nenadic

TL;DR

CAST addresses the contextualization gap in candidate topic words by introducing corpus-aware contextualized word embeddings and a self-similarity screening mechanism, followed by UMAP-HDBSCAN clustering and centroid-based topic representations. The method computes $E_{ ext{final}} = rac{1}{P} \sum_{i=1}^{P} C_{wi}$ with $C_{wi} = \frac{\text{mean}(e_{i1}, \ldots, e_{ik})}{\|\text{mean}(e_{i1}, \ldots, e_{ik})\|}$ to embed words contextually, and filters low-signal words via $SS_w = \text{cosine ext_ similarity}([E_w], [E_w])$ across documents. Empirically, CAST (on 20NewsGroups, BBC News, and Elon Musk tweets) achieves higher topic coherence and competitive diversity compared to LDA, Top2Vec, BERTopic, and TopClus, and demonstrates robustness to noisy data. The approach is modular and can enhance existing topic models, with evaluation supported by automatic and GPT-4-based metrics, highlighting practical benefits for real-world corpora and narrative discovery.

Abstract

Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the contextualization gap. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower than topical tokens, we find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words. Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data. Experiments on news benchmark datasets and one Twitter dataset demonstrate the method's superiority in generating coherent, diverse topics, and handling noisy data, outperforming strong baselines.

CAST: Corpus-Aware Self-similarity Enhanced Topic modelling

TL;DR

CAST addresses the contextualization gap in candidate topic words by introducing corpus-aware contextualized word embeddings and a self-similarity screening mechanism, followed by UMAP-HDBSCAN clustering and centroid-based topic representations. The method computes with to embed words contextually, and filters low-signal words via across documents. Empirically, CAST (on 20NewsGroups, BBC News, and Elon Musk tweets) achieves higher topic coherence and competitive diversity compared to LDA, Top2Vec, BERTopic, and TopClus, and demonstrates robustness to noisy data. The approach is modular and can enhance existing topic models, with evaluation supported by automatic and GPT-4-based metrics, highlighting practical benefits for real-world corpora and narrative discovery.

Abstract

Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the contextualization gap. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower than topical tokens, we find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words. Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data. Experiments on news benchmark datasets and one Twitter dataset demonstrate the method's superiority in generating coherent, diverse topics, and handling noisy data, outperforming strong baselines.

Paper Structure

This paper contains 20 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Two modules to identify meaningful candidate topic words. Module 1: word embeddings contextualized on the dataset. Module 2: self-similarity scores to filter out functional words. Purple points represent documents, with semantically similar ones clustered together. Words with higher self-similarity scores (green) are selected over those with lower scores and assigned to their closest cluster centroid (topic vector: yellow point) as topic words (green points), rather than relying on general-domain topic words (red points).
  • Figure 2: Ablation analysis of varying self-similarity thresholds on CAST-MPNET for the 20NewsGroups (nr_topics = 10) and BBC News (nr_topics = 5) datasets. All results were averaged across five independent runs for each threshold value. The analysis was constrained to a maximum threshold of 0.7 because the model could not identify sufficient topic words after this threshold.
  • Figure 3: The LLM-based prompts to evaluate the coherence and diversity of the models.