Table of Contents
Fetching ...

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

Xiaohao Yang, He Zhao, Dinh Phung, Wray Buntine, Lan Du

TL;DR

This work introduces WALM, a joint evaluation framework for topic models that aligns document representations and topics by comparing LLM-generated keywords with topic-model topical words. It leverages prompts to obtain keywords (including topic-aware prompts) and assesses their agreement with the model outputs using multiple score functions, including overlap-based and embedding/transport-based metrics. Through experiments on 20News and DBpedia across seven models, WALM demonstrates alignment with human judgments and highlights its role as a complementary evaluation alongside perplexity, coherence, and downstream tasks. The approach is extended with contextualized embeddings and sensitivity analyses, offering practical guidance and open-source tooling for robust topic-model evaluation.

Abstract

Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging. Existing evaluation methods are either less comparable across different models (e.g., perplexity) or focus on only one specific aspect of a model (e.g., topic quality or document representation quality) at a time, which is insufficient to reflect the overall model performance. In this paper, we propose WALM (Word Agreement with Language Model), a new evaluation method for topic modeling that considers the semantic quality of document representations and topics in a joint manner, leveraging the power of Large Language Models (LLMs). With extensive experiments involving different types of topic models, WALM is shown to align with human judgment and can serve as a complementary evaluation method to the existing ones, bringing a new perspective to topic modeling. Our software package is available at https://github.com/Xiaohao-Yang/Topic_Model_Evaluation.

LLM Reading Tea Leaves: Automatically Evaluating Topic Models with Large Language Models

TL;DR

This work introduces WALM, a joint evaluation framework for topic models that aligns document representations and topics by comparing LLM-generated keywords with topic-model topical words. It leverages prompts to obtain keywords (including topic-aware prompts) and assesses their agreement with the model outputs using multiple score functions, including overlap-based and embedding/transport-based metrics. Through experiments on 20News and DBpedia across seven models, WALM demonstrates alignment with human judgments and highlights its role as a complementary evaluation alongside perplexity, coherence, and downstream tasks. The approach is extended with contextualized embeddings and sensitivity analyses, offering practical guidance and open-source tooling for robust topic-model evaluation.

Abstract

Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging. Existing evaluation methods are either less comparable across different models (e.g., perplexity) or focus on only one specific aspect of a model (e.g., topic quality or document representation quality) at a time, which is insufficient to reflect the overall model performance. In this paper, we propose WALM (Word Agreement with Language Model), a new evaluation method for topic modeling that considers the semantic quality of document representations and topics in a joint manner, leveraging the power of Large Language Models (LLMs). With extensive experiments involving different types of topic models, WALM is shown to align with human judgment and can serve as a complementary evaluation method to the existing ones, bringing a new perspective to topic modeling. Our software package is available at https://github.com/Xiaohao-Yang/Topic_Model_Evaluation.
Paper Structure (33 sections, 13 equations, 10 figures, 2 tables)

This paper contains 33 sections, 13 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Performance rankings of topic quality (NPMI) and document representation quality (ACC) during model selection. The best model state/checkpoint can be determined using either NPMI or ACC as the selection criterion. However, it can be observed that the rankings for topic quality and document representation quality are inconsistent under the same selection criteria. Experiments are conducted five times, with the number of topics set to 50.
  • Figure 2: An example prompt and output of keywords suggestion by the LLM. In this example, the number of keywords (i.e., N) is set to 5.
  • Figure 3: An illustration of topic-aware keywords suggestion pipeline. The words highlighted in green represent collection-level topics generated by the LLM. Each topic selected in stage 1 is used in the stage 2 prompt to generate topic-aware keywords.
  • Figure 4: Topic models' performance in terms of WALM with keywords suggestion by the LLM on 20News (top row) and DBpedia (bottom row). Error bars represent the standard deviation (omitted for values smaller than the symbol size).
  • Figure 5: Topic models' performance in terms of WALM with topic-aware keywords suggestion by the LLM on 20News (top row) and DBpedia (bottom row). Error bars represent the standard deviation (omitted for values smaller than the symbol size).
  • ...and 5 more figures