An Iterative Approach to Topic Modelling
Albert Wong, Florence Wing Yau Cheng, Ashley Keung, Yamileth Hercules, Mary Alexandra Garcia, Yew-Wei Lim, Lien Pham
TL;DR
This paper tackles the challenge of evaluating and refining topic models beyond a single pass by proposing an iterative BERTopic-based workflow. The method progressively prunes topics and re-runs the model, guided by clustering- validity metrics such as the Adjusted Rand Index ($ARI$) and $NVI$, stopping when improvements fall below a threshold. Using a COVIDSenti-A tweet subset, the study demonstrates a reduction from 52 to 36 topics over four iterations, with ARI nearing 1 and stability metrics near 0, suggesting near-final topic completeness. The work highlights practical benefits for refining topic sets and discusses extending the approach to other algorithms and datasets with or without ground-truth topics, pointing to meaningful future research directions.
Abstract
Topic modelling has become increasingly popular for summarizing text data, such as social media posts and articles. However, topic modelling is usually completed in one shot. Assessing the quality of resulting topics is challenging. No effective methods or measures have been developed for assessing the results or for making further enhancements to the topics. In this research, we propose we propose to use an iterative process to perform topic modelling that gives rise to a sense of completeness of the resulting topics when the process is complete. Using the BERTopic package, a popular method in topic modelling, we demonstrate how the modelling process can be applied iteratively to arrive at a set of topics that could not be further improved upon using one of the three selected measures for clustering comparison as the decision criteria. This demonstration is conducted using a subset of the COVIDSenti-A dataset. The early success leads us to believe that further research using in using this approach in conjunction with other topic modelling algorithms could be viable.
