Table of Contents
Fetching ...

Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

Salma Mekaoui, Hiba Sofyan, Imane Amaaz, Imane Benchrif, Arsalane Zarghili, Ilham Chaker, Nikola S. Nikolov

TL;DR

This work tackles the challenge of producing human-interpretable topic labels from topic modeling outputs without heavy neural models. It proposes two lightweight labelers, Direct Similarity Labeling ($S_i = \cos(E_{ ext{topic}}, E_{w_i})$) and Graph-Enhanced Labeling, which leverages a ConceptNet-based graph expanded up to three hops to enrich candidate labels and compute node embeddings for comparison with the topic embedding $E_{ ext{topic}}$. Evaluations on Topic_Bhatia and 20 Newsgroups show that DSL and GEL achieve high semantic alignment (via $S_j$ or BERTScore) and often outperform pretrained TL models like BART-TL, with GEL frequently delivering the best results and strong generalizability. The results demonstrate that simple, interpretable graph-based labeling can rival or surpass costly neural approaches, offering a practical, scalable solution for topic labeling and interpretability in real-world NLP tasks, with future work exploring additional graphs and representation techniques. $E_{ ext{topic}}$ represents the embedding of the topic words treated as a single sentence, and $S_j = \cos(E_{ ext{topic}}, E_j)$ denotes cosine similarity between topic and candidate node embeddings.

Abstract

Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic's meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.

Efficient Topic Extraction via Graph-Based Labeling: A Lightweight Alternative to Deep Models

TL;DR

This work tackles the challenge of producing human-interpretable topic labels from topic modeling outputs without heavy neural models. It proposes two lightweight labelers, Direct Similarity Labeling () and Graph-Enhanced Labeling, which leverages a ConceptNet-based graph expanded up to three hops to enrich candidate labels and compute node embeddings for comparison with the topic embedding . Evaluations on Topic_Bhatia and 20 Newsgroups show that DSL and GEL achieve high semantic alignment (via or BERTScore) and often outperform pretrained TL models like BART-TL, with GEL frequently delivering the best results and strong generalizability. The results demonstrate that simple, interpretable graph-based labeling can rival or surpass costly neural approaches, offering a practical, scalable solution for topic labeling and interpretability in real-world NLP tasks, with future work exploring additional graphs and representation techniques. represents the embedding of the topic words treated as a single sentence, and denotes cosine similarity between topic and candidate node embeddings.

Abstract

Extracting topics from text has become an essential task, especially with the rapid growth of unstructured textual data. Most existing works rely on highly computational methods to address this challenge. In this paper, we argue that probabilistic and statistical approaches, such as topic modeling (TM), can offer effective alternatives that require fewer computational resources. TM is a statistical method that automatically discovers topics in large collections of unlabeled text; however, it produces topics as distributions of representative words, which often lack clear interpretability. Our objective is to perform topic labeling by assigning meaningful labels to these sets of words. To achieve this without relying on computationally expensive models, we propose a graph-based approach that not only enriches topic words with semantically related terms but also explores the relationships among them. By analyzing these connections within the graph, we derive suitable labels that accurately capture each topic's meaning. We present a comparative study between our proposed method and several benchmarks, including ChatGPT-3.5, across two different datasets. Our method achieved consistently better results than traditional benchmarks in terms of BERTScore and cosine similarity and produced results comparable to ChatGPT-3.5, while remaining computationally efficient. Finally, we discuss future directions for topic labeling and highlight potential research avenues for enhancing interpretability and automation.

Paper Structure

This paper contains 23 sections, 3 equations, 3 figures, 5 tables, 2 algorithms.

Figures (3)

  • Figure 1: Visual schema of the proposed methodology for TL, illustrating the main steps from input topic words to the selection of the final representative label.
  • Figure 2: Visualization of how topic words (e.g., “server”, “infrastructure”, “virtualization”, and “virtual”) become interconnected within ConceptNet after iterative expansion, resulting in a well-connected graph.
  • Figure 3: Cosine similarity scores across the application adopted method on the 20 Newsgroups dataset. The red dashed line represents the benchmark (ChatGPT, 10 words).