Determination of the Number of Topics Intrinsically: Is It Possible?
Victor Bulatov, Vasiliy Alekseev, Konstantin Vorontsov
TL;DR
The paper addresses the challenge of selecting the number of topics $T$ in topic models, arguing that intrinsic metrics do not reliably reflect corpus-intrinsic properties. It systematically evaluates a wide range of intrinsic quality metrics—perplexity, stability, diversity, clustering, information-theoretic criteria, entropy, lift, and top-tokens coherence—across multiple topic models and corpora, using held-out data and subsampling to assess robustness. The findings show that most intrinsic criteria are inconsistent and highly model-dependent, with only relatively simple measures like AIC, BIC, MDL, and Renyi offering somewhat more stable guidance, yet still failing to yield a single universal optimal $T$. The authors conclude that $T$ should be treated as a hyperparameter and urge development of robust modeling approaches (e.g., model architectures resilient to $T$, hierarchical or semi-supervised methods, or alternative strategies) to move beyond the current fixation on an intrinsic, corpus-specific topic count.
Abstract
The number of topics might be the most important parameter of a topic model. The topic modelling community has developed a set of various procedures to estimate the number of topics in a dataset, but there has not yet been a sufficiently complete comparison of existing practices. This study attempts to partially fill this gap by investigating the performance of various methods applied to several topic models on a number of publicly available corpora. Further analysis demonstrates that intrinsic methods are far from being reliable and accurate tools. The number of topics is shown to be a method- and a model-dependent quantity, as opposed to being an absolute property of a particular corpus. We conclude that other methods for dealing with this problem should be developed and suggest some promising directions for further research.
