Table of Contents
Fetching ...

Topic Modelling Black Box Optimization

Roman Akramov, Artem Khamatullin, Svetlana Glazyrina, Maksim Kryzhanovskiy, Roman Ischenko

TL;DR

The paper reframes the critical hyperparameter for LDA topic models—the number of topics $T$—as a discrete black-box optimization problem evaluated by perplexity on a validation set. It systematically compares four optimization paradigms: two hand-designed (GA and ES) and two learned/amortized (PABBO and SABBO) under a fixed evaluation budget. The results show that amortized approaches substantially improve sample- and time-efficiency, with SABBO often finding near-optimal $T$ after a single evaluation and PABBO delivering strong performance with few evaluations. These findings suggest practical guidance for efficiently tuning topic counts in large corpora and motivate future work on supervised and reinforcement-learning approaches to automate hyperparameter selection in topic modeling. The study contributes a rigorous empirical comparison and demonstrates the value of sharpness-aware, amortized optimization in discrete black-box settings for NLP models.

Abstract

Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.

Topic Modelling Black Box Optimization

TL;DR

The paper reframes the critical hyperparameter for LDA topic models—the number of topics —as a discrete black-box optimization problem evaluated by perplexity on a validation set. It systematically compares four optimization paradigms: two hand-designed (GA and ES) and two learned/amortized (PABBO and SABBO) under a fixed evaluation budget. The results show that amortized approaches substantially improve sample- and time-efficiency, with SABBO often finding near-optimal after a single evaluation and PABBO delivering strong performance with few evaluations. These findings suggest practical guidance for efficiently tuning topic counts in large corpora and motivate future work on supervised and reinforcement-learning approaches to automate hyperparameter selection in topic modeling. The study contributes a rigorous empirical comparison and demonstrates the value of sharpness-aware, amortized optimization in discrete black-box settings for NLP models.

Abstract

Choosing the number of topics in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.

Paper Structure

This paper contains 42 sections, 21 equations, 3 figures, 1 table, 5 algorithms.

Figures (3)

  • Figure 1: Problem Statement Schema
  • Figure 2: Best validation perplexity as a function of the number of LDA evaluations for the four corpora.
  • Figure 3: Best validation perplexity as a function of cumulative wall-clock time spent on LDA training and evaluation.