Table of Contents
Fetching ...

Towards the TopMost: A Topic Modeling System Toolkit

Xiaobao Wu, Fengjun Pan, Anh Tuan Luu

TL;DR

The paper targets fragmentation in topic-modeling toolchains by introducing TopMost, a unified toolkit that standardizes datasets, preprocessing, models, training, and evaluation. Implemented in Python with PyTorch, it employs a decoupled, modular architecture to support rapid experimentation and fair model comparisons across multiple scenarios. It extends coverage to basic, hierarchical, dynamic, and cross-lingual topic modeling, integrating a broad set of neural and conventional models, extensive datasets, and diverse evaluation metrics along with visualization tools. By providing tutorials, code examples, and a web-based visualization interface, TopMost aims to accelerate research and practical applications, while acknowledging limitations such as the absence of perplexity-focused metrics and prompting large language models.

Abstract

Topic models have a rich history with various applications and have recently been reinvigorated by neural topic modeling. However, these numerous topic models adopt totally distinct datasets, implementations, and evaluations. This impedes quick utilization and fair comparisons, and thereby hinders their research progress and applications. To tackle this challenge, we in this paper propose a Topic Modeling System Toolkit (TopMost). Compared to existing toolkits, TopMost stands out by supporting more extensive features. It covers a broader spectrum of topic modeling scenarios with their complete lifecycles, including datasets, preprocessing, models, training, and evaluations. Thanks to its highly cohesive and decoupled modular design, TopMost enables rapid utilization, fair comparisons, and flexible extensions of diverse cutting-edge topic models. Our code, tutorials, and documentation are available at https://github.com/bobxwu/topmost.

Towards the TopMost: A Topic Modeling System Toolkit

TL;DR

The paper targets fragmentation in topic-modeling toolchains by introducing TopMost, a unified toolkit that standardizes datasets, preprocessing, models, training, and evaluation. Implemented in Python with PyTorch, it employs a decoupled, modular architecture to support rapid experimentation and fair model comparisons across multiple scenarios. It extends coverage to basic, hierarchical, dynamic, and cross-lingual topic modeling, integrating a broad set of neural and conventional models, extensive datasets, and diverse evaluation metrics along with visualization tools. By providing tutorials, code examples, and a web-based visualization interface, TopMost aims to accelerate research and practical applications, while acknowledging limitations such as the absence of perplexity-focused metrics and prompting large language models.

Abstract

Topic models have a rich history with various applications and have recently been reinvigorated by neural topic modeling. However, these numerous topic models adopt totally distinct datasets, implementations, and evaluations. This impedes quick utilization and fair comparisons, and thereby hinders their research progress and applications. To tackle this challenge, we in this paper propose a Topic Modeling System Toolkit (TopMost). Compared to existing toolkits, TopMost stands out by supporting more extensive features. It covers a broader spectrum of topic modeling scenarios with their complete lifecycles, including datasets, preprocessing, models, training, and evaluations. Thanks to its highly cohesive and decoupled modular design, TopMost enables rapid utilization, fair comparisons, and flexible extensions of diverse cutting-edge topic models. Our code, tutorials, and documentation are available at https://github.com/bobxwu/topmost.
Paper Structure (15 sections, 7 figures, 5 tables)

This paper contains 15 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparison of neural topic models in OCTIS and our TopMost. Our TopMost covers more latest neural topic models than OCTIS.
  • Figure 2: Overall architecture of TopMost. It covers the most common topic modeling scenarios and decouples data loading, model constructions, model training and evaluations in topic modeling lifecycles.
  • Figure 3: A code example for quick start.
  • Figure 4: A code example for training a topic model (ProdLDA Srivastava2017).
  • Figure 5: A code example for evaluating a topic model, including topic coherence, topic diversity, text classification, and clustering.
  • ...and 2 more figures