Table of Contents
Fetching ...

GloCOM: A Short Text Neural Topic Model via Global Clustering Context

Quang Duc Nguyen, Tung Nguyen, Duc Anh Nguyen, Linh Ngo Van, Sang Dinh, Thien Huu Nguyen

TL;DR

GloCOM tackles short-text topic modeling under data and label sparsity by constructing global clustering contexts from PLM embeddings and integrating them into a VAE-based framework. It learns both global cluster-level topic distributions $\theta^g$ and per-document local distributions $\theta^g_d$ via an adaptive factor $\rho_d$, and augments each document with its cluster’s global content using $\widetilde{x}^d = x^d + \eta x^g$ to enrich reconstruction targets. The model jointly optimizes the reconstruction objective and an Embedding Clustering Regularization term, yielding improved topic coherence, topic diversity, and document-topic distributions across multiple short-text datasets, while remaining computationally efficient relative to transport-based aggregation methods. Empirical results show that PLM-based clustering and global context augmentation provide competitive or superior performance against strong baselines, and ablations confirm the contributions of both the global context and augmentation strategy. Overall, GloCOM offers a scalable, effective approach for short-text topic modeling with meaningful improvements in both topic quality and document representations, suitable for downstream retrieval and analysis tasks.

Abstract

Uncovering hidden topics from short texts is challenging for traditional and neural models due to data sparsity, which limits word co-occurrence patterns, and label sparsity, stemming from incomplete reconstruction targets. Although data aggregation offers a potential solution, existing neural topic models often overlook it due to time complexity, poor aggregation quality, and difficulty in inferring topic proportions for individual documents. In this paper, we propose a novel model, GloCOM (Global Clustering COntexts for Topic Models), which addresses these challenges by constructing aggregated global clustering contexts for short documents, leveraging text embeddings from pre-trained language models. GloCOM can infer both global topic distributions for clustering contexts and local distributions for individual short texts. Additionally, the model incorporates these global contexts to augment the reconstruction loss, effectively handling the label sparsity issue. Extensive experiments on short text datasets show that our approach outperforms other state-of-the-art models in both topic quality and document representations.

GloCOM: A Short Text Neural Topic Model via Global Clustering Context

TL;DR

GloCOM tackles short-text topic modeling under data and label sparsity by constructing global clustering contexts from PLM embeddings and integrating them into a VAE-based framework. It learns both global cluster-level topic distributions and per-document local distributions via an adaptive factor , and augments each document with its cluster’s global content using to enrich reconstruction targets. The model jointly optimizes the reconstruction objective and an Embedding Clustering Regularization term, yielding improved topic coherence, topic diversity, and document-topic distributions across multiple short-text datasets, while remaining computationally efficient relative to transport-based aggregation methods. Empirical results show that PLM-based clustering and global context augmentation provide competitive or superior performance against strong baselines, and ablations confirm the contributions of both the global context and augmentation strategy. Overall, GloCOM offers a scalable, effective approach for short-text topic modeling with meaningful improvements in both topic quality and document representations, suitable for downstream retrieval and analysis tasks.

Abstract

Uncovering hidden topics from short texts is challenging for traditional and neural models due to data sparsity, which limits word co-occurrence patterns, and label sparsity, stemming from incomplete reconstruction targets. Although data aggregation offers a potential solution, existing neural topic models often overlook it due to time complexity, poor aggregation quality, and difficulty in inferring topic proportions for individual documents. In this paper, we propose a novel model, GloCOM (Global Clustering COntexts for Topic Models), which addresses these challenges by constructing aggregated global clustering contexts for short documents, leveraging text embeddings from pre-trained language models. GloCOM can infer both global topic distributions for clustering contexts and local distributions for individual short texts. Additionally, the model incorporates these global contexts to augment the reconstruction loss, effectively handling the label sparsity issue. Extensive experiments on short text datasets show that our approach outperforms other state-of-the-art models in both topic quality and document representations.

Paper Structure

This paper contains 30 sections, 10 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Our short text aggregation illustration. We cluster short texts using PLM embeddings and form global documents by concatenating texts from each cluster. Each text is then augmented with its corresponding global document, creating an augmented document used in the reconstruction loss.
  • Figure 2: The probabilistic graphical model illustrating the generative process of documents in GloCOM.
  • Figure 3: The overall architecture of GloCOM. Our methods generate global and augmented documents from clustering based on pre-trained language model embeddings. GloCOM proposes a novel approach to estimate both global and local doc-topic distributions and incorporates the augmented documents into the reconstruction loss.
  • Figure 4: Clustering effectiveness of the GloCOM model with different representations for short text clustering ($K = 50$) on the SearchSnippets dataset.
  • Figure 5: The t-SNE visualization shows the topic distributions learned by various short text models.