Table of Contents
Fetching ...

TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs

Selma Wanna, Ryan Barron, Nick Solovyev, Maksim E. Eren, Manish Bhattarai, Kim Rasmussen, Boian S. Alexandrov

TL;DR

This work tackles automatic topic labeling for NMFk-derived topic clusters by marrying NMFk outputs with prompt-tuned LLMs through Chain-of-Thought prompting and Optuna-driven prompt optimization. The approach, validated on over 34,000 Knowledge Graph abstracts, demonstrates that a smaller model like Meta-Llama-3-8B-Instruct can achieve SME-level labeling accuracy (average 3.78/5) after iterative prompting, while broader model scaling requires more optimization rounds. Key contributions include a two-stage prompt-filtering pipeline that leverages document features from NMFk, BERTScore-based pruning, and SME feedback to produce high-quality labels with reduced SME effort. The findings highlight the potential for generalizing automated topic labeling across domains and point to future directions in embedding-based task representations and RLHF-driven improvements.

Abstract

Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) matrix to uncover latent topics and segment the dataset accordingly. While useful for highlighting patterns and clustering documents, NMF does not provide explicit topic labels, necessitating subject matter experts (SMEs) to assign labels manually. We present a methodology for automating topic labeling in documents clustered via NMF with automatic model determination (NMFk). By leveraging the output of NMFk and employing prompt engineering, we utilize large language models (LLMs) to generate accurate topic labels. Our case study on over 34,000 scientific abstracts on Knowledge Graphs demonstrates the effectiveness of our method in enhancing knowledge management and document organization.

TopicTag: Automatic Annotation of NMF Topic Models Using Chain of Thought and Prompt Tuning with LLMs

TL;DR

This work tackles automatic topic labeling for NMFk-derived topic clusters by marrying NMFk outputs with prompt-tuned LLMs through Chain-of-Thought prompting and Optuna-driven prompt optimization. The approach, validated on over 34,000 Knowledge Graph abstracts, demonstrates that a smaller model like Meta-Llama-3-8B-Instruct can achieve SME-level labeling accuracy (average 3.78/5) after iterative prompting, while broader model scaling requires more optimization rounds. Key contributions include a two-stage prompt-filtering pipeline that leverages document features from NMFk, BERTScore-based pruning, and SME feedback to produce high-quality labels with reduced SME effort. The findings highlight the potential for generalizing automated topic labeling across domains and point to future directions in embedding-based task representations and RLHF-driven improvements.

Abstract

Topic modeling is a technique for organizing and extracting themes from large collections of unstructured text. Non-negative matrix factorization (NMF) is a common unsupervised approach that decomposes a term frequency-inverse document frequency (TF-IDF) matrix to uncover latent topics and segment the dataset accordingly. While useful for highlighting patterns and clustering documents, NMF does not provide explicit topic labels, necessitating subject matter experts (SMEs) to assign labels manually. We present a methodology for automating topic labeling in documents clustered via NMF with automatic model determination (NMFk). By leveraging the output of NMFk and employing prompt engineering, we utilize large language models (LLMs) to generate accurate topic labels. Our case study on over 34,000 scientific abstracts on Knowledge Graphs demonstrates the effectiveness of our method in enhancing knowledge management and document organization.
Paper Structure (12 sections, 2 figures, 2 tables)

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The TopicTag pipeline. Stage 1 (left) illustrates the prompt-optimization framework applied to our training set of topic clusters. Documents are processed by the NMFk algorithm to generate feature information, which is then integrated into prompts. These prompts are evaluated by LLMs, with their label predictions compared against ground truth labels. Prompts are refined by maximizing NLG or human rater scores. In Stage 2, the optimal prompts are assessed on the test set.
  • Figure 2: The average performance for each tested LLM marginalized over our test set, prompt templates, document features, and LLM hyperparameter configurations. We believe these variations in conjunction with the reported IAA in Table \ref{['tab:CORR-STUDY']} lead to the reported variance.