Using LLM-Based Approaches to Enhance and Automate Topic Labeling
Trishia Khandelwal
TL;DR
This work investigates automating topic labeling by leveraging Large Language Models (LLMs) to convert BERTopic-derived topic keywords and document summaries into concise, context-rich labels. It introduces four labeling approaches that vary in how they sample and emphasize documents and subtopics, coupled with a novel semantic-representativeness metric based on Sentence-BERT embeddings and cosine similarity. Evaluations on BBC News and 20 Newsgroups show dataset-dependent performance, with Approach 3 generally yielding strong labels and Approach 2 performing well in more overlapping category sets. The study highlights the potential of LLMs to enhance topic interpretability, while outlining limitations and directions for broader validation and metric refinement.
Abstract
Topic modeling has become a crucial method for analyzing text data, particularly for extracting meaningful insights from large collections of documents. However, the output of these models typically consists of lists of keywords that require manual interpretation for precise labeling. This study explores the use of Large Language Models (LLMs) to automate and enhance topic labeling by generating more meaningful and contextually appropriate labels. After applying BERTopic for topic modeling, we explore different approaches to select keywords and document summaries within each topic, which are then fed into an LLM to generate labels. Each approach prioritizes different aspects, such as dominant themes or diversity, to assess their impact on label quality. Additionally, recognizing the lack of quantitative methods for evaluating topic labels, we propose a novel metric that measures how semantically representative a label is of all documents within a topic.
