Table of Contents
Fetching ...

Interactive Topic Models with Optimal Transport

Garima Dhanania, Sheshera Mysore, Chau Minh Pham, Mohit Iyyer, Hamed Zamani, Andrew McCallum

TL;DR

EdTM addresses the need for analyst-guided, label-name supervised topic modeling by formulating document-to-label assignment as a global optimal transport problem, using LM/LLM-based affinities to define costs. It supports flexible supervision forms (topic names, descriptions, or seed documents) and partial assignments, enabling robust, interactive exploration even with noisy inputs. The method computes entropy-regularized transport plans $\mathcal{W}$ and leverages batched Sinkhorn/Partial-OT to scale to large corpora, producing coherent topic allocations. Empirically, EdTM achieves high-quality topics and robust performance across diverse datasets, often outperforming clustering and certain LDA baselines and approaching large-LM baselines while providing analyst-driven control. This approach enables practical, interpretable, and scalable interactive topic modeling with direct analyst influence over topic definitions.

Abstract

Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.

Interactive Topic Models with Optimal Transport

TL;DR

EdTM addresses the need for analyst-guided, label-name supervised topic modeling by formulating document-to-label assignment as a global optimal transport problem, using LM/LLM-based affinities to define costs. It supports flexible supervision forms (topic names, descriptions, or seed documents) and partial assignments, enabling robust, interactive exploration even with noisy inputs. The method computes entropy-regularized transport plans and leverages batched Sinkhorn/Partial-OT to scale to large corpora, producing coherent topic allocations. Empirically, EdTM achieves high-quality topics and robust performance across diverse datasets, often outperforming clustering and certain LDA baselines and approaching large-LM baselines while providing analyst-driven control. This approach enables practical, interpretable, and scalable interactive topic modeling with direct analyst influence over topic definitions.

Abstract

Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
Paper Structure (16 sections, 1 equation, 1 figure, 5 tables, 1 algorithm)

This paper contains 16 sections, 1 equation, 1 figure, 5 tables, 1 algorithm.

Figures (1)

  • Figure 1: Interactive topic modeling with EdTM consists of two steps, document-topic scoring for analyst provided topic names using LM/LLM bi-encoders and cross-encoders followed by computation of partial or complete topic assignments using optimal transport. Analyst topic names may take on various forms such as label names, descriptions, or documents, to support rich forms of interaction.