TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs
Selma Wanna, Ryan Barron, Nick Solovyev, Maksim E. Eren, Manish Bhattarai, Kim Rasmussen, Boian S. Alexandrov
TL;DR
This work tackles automatic topic labeling for NMFk-derived topic clusters by marrying NMFk outputs with prompt-tuned LLMs through Chain-of-Thought prompting and Optuna-driven prompt optimization. The approach, validated on over 34,000 Knowledge Graph abstracts, demonstrates that a smaller model like Meta-Llama-3-8B-Instruct can achieve SME-level labeling accuracy (average 3.78/5) after iterative prompting, while broader model scaling requires more optimization rounds. Key contributions include a two-stage prompt-filtering pipeline that leverages document features from NMFk, BERTScore-based pruning, and SME feedback to produce high-quality labels with reduced SME effort. The findings highlight the potential for generalizing automated topic labeling across domains and point to future directions in embedding-based task representations and RLHF-driven improvements.
Abstract
Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) matrix to uncover latent topics and segment the dataset accordingly. While useful for highlighting patterns and clustering documents, NMF does not provide explicit topic labels, necessitating subject matter experts (SMEs) to assign labels manually. We present a methodology for automating topic labeling in documents clustered via NMF with automatic model determination (NMFk). By leveraging the output of NMFk and employing prompt engineering, we utilize large language models (LLMs) to generate accurate topic labels. Our case study on over 34,000 scientific abstracts on Knowledge Graphs demonstrates the effectiveness of our method in enhancing knowledge management and document organization.
