Table of Contents
Fetching ...

Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms

Melkamu Abay Mersha, Mesay Gemeda yigezu, Jugal Kalita

TL;DR

This work addresses the challenge of capturing contextual semantics in topic modeling by proposing an end-to-end semantic-driven pipeline that uses transformer-based embeddings. The method employs SBERT for document embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering, followed by a cluster-centric topic extraction that filters non-relevant words using contextual similarity, with a formal ranking given by the average cosine similarity to cluster sentences. It demonstrates superior topic coherence across multiple datasets (20NewsGroups, BBC News, Trump’s tweets) compared to traditional models (LDA, CTM, ETM, BERTopic) and ChatGPT, highlighting robustness and scalability. The approach offers a practical path to coherent, context-aware topic extraction in large corpora, with potential for continual improvement as embedding models evolve.

Abstract

Topic modeling is a powerful technique to discover hidden topics and patterns within a collection of documents without prior knowledge. Traditional topic modeling and clustering-based techniques encounter challenges in capturing contextual semantic information. This study introduces an innovative end-to-end semantic-driven topic modeling technique for the topic extraction process, utilizing advanced word and document embeddings combined with a powerful clustering algorithm. This semantic-driven approach represents a significant advancement in topic modeling methodologies. It leverages contextual semantic information to extract coherent and meaningful topics. Specifically, our model generates document embeddings using pre-trained transformer-based language models, reduces the dimensions of the embeddings, clusters the embeddings based on semantic similarity, and generates coherent topics for each cluster. Compared to ChatGPT and traditional topic modeling algorithms, our model provides more coherent and meaningful topics.

Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms

TL;DR

This work addresses the challenge of capturing contextual semantics in topic modeling by proposing an end-to-end semantic-driven pipeline that uses transformer-based embeddings. The method employs SBERT for document embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering, followed by a cluster-centric topic extraction that filters non-relevant words using contextual similarity, with a formal ranking given by the average cosine similarity to cluster sentences. It demonstrates superior topic coherence across multiple datasets (20NewsGroups, BBC News, Trump’s tweets) compared to traditional models (LDA, CTM, ETM, BERTopic) and ChatGPT, highlighting robustness and scalability. The approach offers a practical path to coherent, context-aware topic extraction in large corpora, with potential for continual improvement as embedding models evolve.

Abstract

Topic modeling is a powerful technique to discover hidden topics and patterns within a collection of documents without prior knowledge. Traditional topic modeling and clustering-based techniques encounter challenges in capturing contextual semantic information. This study introduces an innovative end-to-end semantic-driven topic modeling technique for the topic extraction process, utilizing advanced word and document embeddings combined with a powerful clustering algorithm. This semantic-driven approach represents a significant advancement in topic modeling methodologies. It leverages contextual semantic information to extract coherent and meaningful topics. Specifically, our model generates document embeddings using pre-trained transformer-based language models, reduces the dimensions of the embeddings, clusters the embeddings based on semantic similarity, and generates coherent topics for each cluster. Compared to ChatGPT and traditional topic modeling algorithms, our model provides more coherent and meaningful topics.
Paper Structure (20 sections, 1 equation, 2 figures, 6 tables, 1 algorithm)

This paper contains 20 sections, 1 equation, 2 figures, 6 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of the proposed pipeline model architecture.
  • Figure 2: (a) Dimensionality reduction of 384-dimensional sentence vectors from the 20 newsgroups dataset to 2 dimensions with UMAP. (b) Highlighting semantically similar dense sentence areas via HDBSCAN clustering in dimensionally reduced sentence vectors from the 20 newsgroups dataset. Scattered red points indicate sentences labeled as noise or outliers. (c) Semantically similar dense sentence areas, excluding outlier sentences (HDBSCAN noise removal capability), were identified with HDBSCAN from the 20 newsgroups dataset.