GloCOM: A Short Text Neural Topic Model via Global Clustering Context
Quang Duc Nguyen, Tung Nguyen, Duc Anh Nguyen, Linh Ngo Van, Sang Dinh, Thien Huu Nguyen
TL;DR
GloCOM tackles short-text topic modeling under data and label sparsity by constructing global clustering contexts from PLM embeddings and integrating them into a VAE-based framework. It learns both global cluster-level topic distributions $\theta^g$ and per-document local distributions $\theta^g_d$ via an adaptive factor $\rho_d$, and augments each document with its cluster’s global content using $\widetilde{x}^d = x^d + \eta x^g$ to enrich reconstruction targets. The model jointly optimizes the reconstruction objective and an Embedding Clustering Regularization term, yielding improved topic coherence, topic diversity, and document-topic distributions across multiple short-text datasets, while remaining computationally efficient relative to transport-based aggregation methods. Empirical results show that PLM-based clustering and global context augmentation provide competitive or superior performance against strong baselines, and ablations confirm the contributions of both the global context and augmentation strategy. Overall, GloCOM offers a scalable, effective approach for short-text topic modeling with meaningful improvements in both topic quality and document representations, suitable for downstream retrieval and analysis tasks.
Abstract
Uncovering hidden topics from short texts is challenging for traditional and neural models due to data sparsity, which limits word co-occurrence patterns, and label sparsity, stemming from incomplete reconstruction targets. Although data aggregation offers a potential solution, existing neural topic models often overlook it due to time complexity, poor aggregation quality, and difficulty in inferring topic proportions for individual documents. In this paper, we propose a novel model, GloCOM (Global Clustering COntexts for Topic Models), which addresses these challenges by constructing aggregated global clustering contexts for short documents, leveraging text embeddings from pre-trained language models. GloCOM can infer both global topic distributions for clustering contexts and local distributions for individual short texts. Additionally, the model incorporates these global contexts to augment the reconstruction loss, effectively handling the label sparsity issue. Extensive experiments on short text datasets show that our approach outperforms other state-of-the-art models in both topic quality and document representations.
