AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

Maria Khodorchenko; Nikolay Butakov; Maxim Zuev; Denis Nasonov

AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

Maria Khodorchenko, Nikolay Butakov, Maxim Zuev, Denis Nasonov

TL;DR

This work presents an AutoTM 2.0 framework for optimizing additively regularized topic models and shows that AutoTM 2.0 achieves better performance compared to the previous AutoTM by providing results on 5 datasets with different features and in two different languages.

Abstract

In this work, we present an AutoTM 2.0 framework for optimizing additively regularized topic models. Comparing to the previous version, this version includes such valuable improvements as novel optimization pipeline, LLM-based quality metrics and distributed mode. AutoTM 2.0 is a comfort tool for specialists as well as non-specialists to work with text documents to conduct exploratory data analysis or to perform clustering task on interpretable set of features. Quality evaluation is based on specially developed metrics such as coherence and gpt-4-based approaches. Researchers and practitioners can easily integrate new optimization algorithms and adapt novel metrics to enhance modeling quality and extend their experiments. We show that AutoTM 2.0 achieves better performance compared to the previous AutoTM by providing results on 5 datasets with different features and in two different languages.

AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

TL;DR

Abstract

Paper Structure (14 sections, 6 figures, 1 table)

This paper contains 14 sections, 6 figures, 1 table.

Introduction
Related work
Framework design
Dataset preprocessing
Optimization Approaches
Optimization pipelines
Quality estimation
Surrogate modeling
Distributed mode
Framework Performance
Datasets overview
Pipelines comparison
Quality metrics performance
Conclusion

Figures (6)

Figure 1: General design of the AutoTM 2.0 framework.
Figure 2: An example of a pipeline with 4 training stages.
Figure 3: Vectorization scheme for surrogate modeling in graph-based approach.
Figure 4: Distributed mode schema.
Figure 5: Average fitness values with 90% confidence interval for 5 datasets ((a) 20 Newsgroups (b) Amazon food (c) Banners (d) Hotel reviews (e) Lenta.ru) by the number of used iterations with the usage of a surrogate model.
...and 1 more figures

AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

TL;DR

Abstract

AutoTM 2.0: Automatic Topic Modeling Framework for Documents Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (6)