Table of Contents
Fetching ...

Iterative Improvement of an Additively Regularized Topic Model

Alex Gorbulev, Vasiliy Alekseev, Konstantin Vorontsov

TL;DR

The paper addresses instability and incomplete coverage in topic modelling by introducing ITAR, an iteratively updated additively regularized topic model. ITAR trains a sequence of related models where each step fixes previously discovered good topics and decorrelates or filters out bad ones through two regularizers, yielding monotonic improvement in good-topic coverage. Empirical results show ITAR achieves the highest proportion of good topics and robust topic diversity across multiple datasets, with moderate perplexity relative to baselines like LDA and BERTopic. The approach offers a deterministic, provable path to better topic models and provides open-source code for replication.

Abstract

Topic modelling is fundamentally a soft clustering problem (of known objects -- documents, over unknown clusters -- topics). That is, the task is incorrectly posed. In particular, the topic models are unstable and incomplete. All this leads to the fact that the process of finding a good topic model (repeated hyperparameter selection, model training, and topic quality assessment) can be particularly long and labor-intensive. We aim to simplify the process, to make it more deterministic and provable. To this end, we present a method for iterative training of a topic model. The essence of the method is that a series of related topic models are trained so that each subsequent model is at least as good as the previous one, i.e., that it retains all the good topics found earlier. The connection between the models is achieved by additive regularization. The result of this iterative training is the last topic model in the series, which we call the iteratively updated additively regularized topic model (ITAR). Experiments conducted on several collections of natural language texts show that the proposed ITAR model performs better than other popular topic models (LDA, ARTM, BERTopic), its topics are diverse, and its perplexity (ability to "explain" the underlying data) is moderate.

Iterative Improvement of an Additively Regularized Topic Model

TL;DR

The paper addresses instability and incomplete coverage in topic modelling by introducing ITAR, an iteratively updated additively regularized topic model. ITAR trains a sequence of related models where each step fixes previously discovered good topics and decorrelates or filters out bad ones through two regularizers, yielding monotonic improvement in good-topic coverage. Empirical results show ITAR achieves the highest proportion of good topics and robust topic diversity across multiple datasets, with moderate perplexity relative to baselines like LDA and BERTopic. The approach offers a deterministic, provable path to better topic models and provides open-source code for replication.

Abstract

Topic modelling is fundamentally a soft clustering problem (of known objects -- documents, over unknown clusters -- topics). That is, the task is incorrectly posed. In particular, the topic models are unstable and incomplete. All this leads to the fact that the process of finding a good topic model (repeated hyperparameter selection, model training, and topic quality assessment) can be particularly long and labor-intensive. We aim to simplify the process, to make it more deterministic and provable. To this end, we present a method for iterative training of a topic model. The essence of the method is that a series of related topic models are trained so that each subsequent model is at least as good as the previous one, i.e., that it retains all the good topics found earlier. The connection between the models is achieved by additive regularization. The result of this iterative training is the last topic model in the series, which we call the iteratively updated additively regularized topic model (ITAR). Experiments conducted on several collections of natural language texts show that the proposed ITAR model performs better than other popular topic models (LDA, ARTM, BERTopic), its topics are diverse, and its perplexity (ability to "explain" the underlying data) is moderate.
Paper Structure (25 sections, 18 equations, 3 figures, 5 tables)

This paper contains 25 sections, 18 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The idea of an iterative approach to topic model improvement. The topics of the initial model $M_0$ are automatically or semi-automatically classified into good $T_+$, bad $T_-$, and "unremarkable" $T_0$ (those that you can't afford to lose, not bad, but not relevant for the study, for example, they can be duplicates of topics from $T_+$). Next, a new topic model $M_1$ is trained so that it retains all the topics of $T_+$, and at the same time has no topics from $T_-$. Thus, model $M_1$ is at least as good as model $M_0$ in terms of the number of good topics $T_+'$, and possibly even better: $T_+' \supseteq T_+$.
  • Figure 2: Percentage of good model topics depending on iteration ($\uparrow$). In iterative models (TopicBank2, ITAR, ITAR2), each subsequent model is trained based on the previous one, hence the monotonic dependence (in contrast to non-iterative models).
  • Figure 3: Percentage of good topics in the model as a function of iteration ($\uparrow$). In contrast to the results shown in Fig. \ref{['fig:toptok-plots']}, the goodness of a topic was determined by the value of its intra-text coherence, rather than by the coherence of top-word co-occurrences. Since the intra-text coherence scores of different topics are not independent, in this case, it is more difficult for the iterative model to accumulate good topics. (As can be seen in the graph for ITAR2 model, when more than half of the iterations passed without adding new topics at all. Moreover, it can be seen that the TopicBank2 model could perform better than ITAR2, because in TopicBank the models are trained independently of each other at different iterations, and therefore collected good topics do not influence the quality assessment of new topics; in ITAR2, pairwise correlation with collected good topics is also applied, which further narrows the search area for new topics. The graph for ITAR stops before reaching maximum iteration because so many good topics were accumulated that their fixation by regularization led to the degeneration of the remaining free topics.)