Table of Contents
Fetching ...

TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction

Aoi Fujita, Taichi Yamamoto, Yuri Nakayama, Ryota Kobayashi

TL;DR

TopiCLEAR tackles the challenge of topic extraction from short social-media texts by embedding documents with SBERT and applying an adaptive dimension reduction framework that iteratively refines clustering via a linear discriminant projection and GMM. The method requires no preprocessing and is evaluated on four labeled datasets, where it achieves higher ARI/AMI alignment with human annotations than seven baselines, including embedding-based and language-model approaches. Qualitative analysis on TweetTopic shows interpretable, human-aligned topics, underscoring TopiCLEAR's practicality for social media analytics and large-scale web content understanding. Overall, the work demonstrates that combining contextual embeddings with adaptive, projection-based clustering yields robust, scalable topic extraction for short and informal text.

Abstract

Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The clusters are then refined iteratively using a supervised projection based on linear discriminant analysis, followed by GMM-based clustering until convergence. Notably, our method operates directly on raw text, eliminating the need for preprocessing steps such as stop word removal. We evaluate our approach on four diverse datasets, 20News, AgNewsTitle, Reddit, and TweetTopic, each containing human-labeled topic information. Compared with seven baseline methods, including a recent SBERT-based method and a zero-shot generative AI method, our approach achieves the highest similarity to human-annotated topics, with significant improvements for both social media posts and online news articles. Additionally, qualitative analysis shows that our method produces more interpretable topics, highlighting its potential for applications in social media data and web content analytics.

TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction

TL;DR

TopiCLEAR tackles the challenge of topic extraction from short social-media texts by embedding documents with SBERT and applying an adaptive dimension reduction framework that iteratively refines clustering via a linear discriminant projection and GMM. The method requires no preprocessing and is evaluated on four labeled datasets, where it achieves higher ARI/AMI alignment with human annotations than seven baselines, including embedding-based and language-model approaches. Qualitative analysis on TweetTopic shows interpretable, human-aligned topics, underscoring TopiCLEAR's practicality for social media analytics and large-scale web content understanding. Overall, the work demonstrates that combining contextual embeddings with adaptive, projection-based clustering yields robust, scalable topic extraction for short and informal text.

Abstract

Rapid expansion of social media platforms such as X (formerly Twitter), Facebook, and Reddit has enabled large-scale analysis of public perceptions on diverse topics, including social issues, politics, natural disasters, and consumer sentiment. Topic modeling is a widely used approach for uncovering latent themes in text data, typically framed as an unsupervised classification task. However, traditional models, originally designed for longer and more formal documents, struggle with short social media posts due to limited co-occurrence statistics, fragmented semantics, inconsistent spelling, and informal language. To address these challenges, we propose a new method, TopiCLEAR: Topic extraction by CLustering Embeddings with Adaptive dimensional Reduction. Specifically, each text is embedded using Sentence-BERT (SBERT) and provisionally clustered using Gaussian Mixture Models (GMM). The clusters are then refined iteratively using a supervised projection based on linear discriminant analysis, followed by GMM-based clustering until convergence. Notably, our method operates directly on raw text, eliminating the need for preprocessing steps such as stop word removal. We evaluate our approach on four diverse datasets, 20News, AgNewsTitle, Reddit, and TweetTopic, each containing human-labeled topic information. Compared with seven baseline methods, including a recent SBERT-based method and a zero-shot generative AI method, our approach achieves the highest similarity to human-annotated topics, with significant improvements for both social media posts and online news articles. Additionally, qualitative analysis shows that our method produces more interpretable topics, highlighting its potential for applications in social media data and web content analytics.

Paper Structure

This paper contains 23 sections, 7 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Effect of the label noise on the evaluation measures for topic extraction. Five evaluation measures ($C_{\rm UCI}$, $C_{\rm NPMI}$, $C_v$; ARI, and AMI) were calculated from the data obtained by adding label noise to the ground truth data. The noise level $p_n$ was increased from 0 to 0.8. We plotted the average values from 40 experiments. Note that the UCI coherence values were not plotted in panels a and b, because they were too small.
  • Figure 2: Dependence of ARI score on the document length, i.e., word count. Two datasets, a) 20News and b) Reddit, were examined.
  • Figure 3: Composition ratio of human-annotated topics for topics extracted by (a) TopiCLEAR and (b) LDA from the TweetTopic dataset.
  • Figure 4: Dependence of AMI score on the document length, i.e., word count. Two datasets, a) 20News and b) Reddit, were examined.