Towards the TopMost: A Topic Modeling System Toolkit
Xiaobao Wu, Fengjun Pan, Anh Tuan Luu
TL;DR
The paper targets fragmentation in topic-modeling toolchains by introducing TopMost, a unified toolkit that standardizes datasets, preprocessing, models, training, and evaluation. Implemented in Python with PyTorch, it employs a decoupled, modular architecture to support rapid experimentation and fair model comparisons across multiple scenarios. It extends coverage to basic, hierarchical, dynamic, and cross-lingual topic modeling, integrating a broad set of neural and conventional models, extensive datasets, and diverse evaluation metrics along with visualization tools. By providing tutorials, code examples, and a web-based visualization interface, TopMost aims to accelerate research and practical applications, while acknowledging limitations such as the absence of perplexity-focused metrics and prompting large language models.
Abstract
Topic models have a rich history with various applications and have recently been reinvigorated by neural topic modeling. However, these numerous topic models adopt totally distinct datasets, implementations, and evaluations. This impedes quick utilization and fair comparisons, and thereby hinders their research progress and applications. To tackle this challenge, we in this paper propose a Topic Modeling System Toolkit (TopMost). Compared to existing toolkits, TopMost stands out by supporting more extensive features. It covers a broader spectrum of topic modeling scenarios with their complete lifecycles, including datasets, preprocessing, models, training, and evaluations. Thanks to its highly cohesive and decoupled modular design, TopMost enables rapid utilization, fair comparisons, and flexible extensions of diverse cutting-edge topic models. Our code, tutorials, and documentation are available at https://github.com/bobxwu/topmost.
