Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams
Federica Granese, Benjamin Navet, Serena Villata, Charles Bouveyron
TL;DR
The paper tackles the challenge of online topic modeling on continuous text streams and the need to track evolving topics in real time while detecting shifts. It introduces StreamETM, an online extension of the Embedded Topic Model that uses unbalanced optimal transport to merge and discover topics across consecutive batches, and adds an online change point detection component. The approach preserves word and topic embeddings, employs a cosine-based transport cost, and updates topic representations with a memory parameter, with public code available for reproducibility. Empirical results on the 20kNewsGroup dataset show that StreamETM outperforms competitive online methods in topic coherence/diversity and achieves more reliable change-point detection, indicating strong practical potential for monitoring evolving text streams.
Abstract
Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.
