Table of Contents
Fetching ...

Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams

Federica Granese, Benjamin Navet, Serena Villata, Charles Bouveyron

TL;DR

The paper tackles the challenge of online topic modeling on continuous text streams and the need to track evolving topics in real time while detecting shifts. It introduces StreamETM, an online extension of the Embedded Topic Model that uses unbalanced optimal transport to merge and discover topics across consecutive batches, and adds an online change point detection component. The approach preserves word and topic embeddings, employs a cosine-based transport cost, and updates topic representations with a memory parameter, with public code available for reproducibility. Empirical results on the 20kNewsGroup dataset show that StreamETM outperforms competitive online methods in topic coherence/diversity and achieves more reliable change-point detection, indicating strong practical potential for monitoring evolving text streams.

Abstract

Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.

Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams

TL;DR

The paper tackles the challenge of online topic modeling on continuous text streams and the need to track evolving topics in real time while detecting shifts. It introduces StreamETM, an online extension of the Embedded Topic Model that uses unbalanced optimal transport to merge and discover topics across consecutive batches, and adds an online change point detection component. The approach preserves word and topic embeddings, employs a cosine-based transport cost, and updates topic representations with a memory parameter, with public code available for reproducibility. Empirical results on the 20kNewsGroup dataset show that StreamETM outperforms competitive online methods in topic coherence/diversity and achieves more reliable change-point detection, indicating strong practical potential for monitoring evolving text streams.

Abstract

Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.

Paper Structure

This paper contains 34 sections, 7 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Topic embeddings in a Euclidean space. On the left, the setting is without perturbation, and on the center and the right, a perturbation is added to the dark blue diamond at the position $(1.01,0.45)$. Dark blue diamonds represent topic embeddings at time $t-1$, while light blue markers indicate topic embeddings at time $t$ before merging. The merged embeddings obtained via UOT are shown as '$\times$', whereas those obtained using ED are shown as '$+$'. Dashed lines connect topics matched by UOT, while dot-dashed lines indicate associations based on ED.
  • Figure 2: The left figure shows the transport map, while the right one depicts the cosine similarity map. In both cases, darker cells indicate regions of higher transported mass, \ref{['fig:ot']}, or shorted cosine distance, \ref{['fig:cosine']}.
  • Figure 3: Qualitative assessment of topic evolution over time in the Custom setting. Blue vertical lines indicate the change points detected by the algorithm.
  • Figure 4: Custom setting. The most frequent word for topic across the 15 training runs.
  • Figure 5: Custom setting. In (a) harmonic mean between TC and TD and in (b) ROC curves. Results were computed across the 15 training runs.
  • ...and 7 more figures