Table of Contents
Fetching ...

Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation

Zhiyin Tan, Jennifer D'Souza

TL;DR

This work addresses the challenge of evaluating dynamic topic taxonomies in growing scientific corpora by proposing an automated framework that uses Large Language Models (LLMs) as evaluators. It defines a holistic set of metrics—topic-words coherence, repetitiveness, topic diversity, and topic-document alignment—each with tailored prompts and adversarial tests to ensure robustness and interpretability. The authors validate the framework on 20NG and AGRIS across four topic models (LDA, ProdLDA, CombinedTM, BERTopic), demonstrating scalability and the ability to reveal evaluator-specific biases. The approach offers a more holistic, dynamic evaluation that can inform model selection and downstream tasks, with implications for improved literature discovery and research analytics.

Abstract

This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efficiently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method's robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.

Bridging the Evaluation Gap: Leveraging Large Language Models for Topic Model Evaluation

TL;DR

This work addresses the challenge of evaluating dynamic topic taxonomies in growing scientific corpora by proposing an automated framework that uses Large Language Models (LLMs) as evaluators. It defines a holistic set of metrics—topic-words coherence, repetitiveness, topic diversity, and topic-document alignment—each with tailored prompts and adversarial tests to ensure robustness and interpretability. The authors validate the framework on 20NG and AGRIS across four topic models (LDA, ProdLDA, CombinedTM, BERTopic), demonstrating scalability and the ability to reveal evaluator-specific biases. The approach offers a more holistic, dynamic evaluation that can inform model selection and downstream tasks, with implications for improved literature discovery and research analytics.

Abstract

This study presents a framework for automated evaluation of dynamically evolving topic taxonomies in scientific literature using Large Language Models (LLMs). In digital library systems, topic modeling plays a crucial role in efficiently organizing and retrieving scholarly content, guiding researchers through complex knowledge landscapes. As research domains proliferate and shift, traditional human centric and static evaluation methods struggle to maintain relevance. The proposed approach harnesses LLMs to measure key quality dimensions, such as coherence, repetitiveness, diversity, and topic-document alignment, without heavy reliance on expert annotators or narrow statistical metrics. Tailored prompts guide LLM assessments, ensuring consistent and interpretable evaluations across various datasets and modeling techniques. Experiments on benchmark corpora demonstrate the method's robustness, scalability, and adaptability, underscoring its value as a more holistic and dynamic alternative to conventional evaluation strategies.

Paper Structure

This paper contains 38 sections, 1 equation, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Radar plots comparing the evaluation trends of three LLMs in the results ($K=50$) of the 20NG
  • Figure 2: Radar plot comparing the evaluation trends of three LLMs in the results ($K=50$) of the AGRIS