Table of Contents
Fetching ...

Fully Bayesian Approaches to Topics over Time

Julián Cendrero, Julio Gonzalo, Ivar Zapata

TL;DR

The paper addresses instability in the Topics over Time (ToT) model by reformulating it in a fully Bayesian framework (BToT) through a conjugate Beta-prior over time-topic distributions. To balance the disproportionate influence of timestamps, it introduces Weighted BToT (WBToT), which repeats timestamp observations via an auxiliary topic assignment, enabling stable online inference. Using variational inference in batch and online modes, the authors demonstrate that WBToT improves event detection and maintains coherence on two large datasets (SOTU addresses and a COVID-19 Twitter corpus), while offering scalable online optimization absent in standard ToT. The work highlights the practical potential of fully Bayesian, time-aware topic models for event-centric analysis in long-running and large-scale timestamped corpora.

Abstract

The Topics over Time (ToT) model captures thematic changes in timestamped datasets by explicitly modeling publication dates jointly with word co-occurrence patterns. However, ToT was not approached in a fully Bayesian fashion, a flaw that makes it susceptible to stability problems. To address this issue, we propose a fully Bayesian Topics over Time (BToT) model via the introduction of a conjugate prior to the Beta distribution. This prior acts as a regularization that prevents the online version of the algorithm from unstable updates when a topic is poorly represented in a mini-batch. The characteristics of this prior to the Beta distribution are studied here for the first time. Still, this model suffers from a difference in scale between the single-time observations and the multiplicity of words per document. A variation of BToT, Weighted Bayesian Topics over Time (WBToT), is proposed as a solution. In WBToT, publication dates are repeated a certain number of times per document, which balances the relative influence of words and timestamps along the inference process. We have tested our models on two datasets: a collection of over 200 years of US state-of-the-union (SOTU) addresses and a large-scale COVID-19 Twitter corpus of 10 million tweets. The results show that WBToT captures events better than Latent Dirichlet Allocation and other SOTA topic models like BERTopic: the median absolute deviation of the topic presence over time is reduced by $51\%$ and $34\%$, respectively. Our experiments also demonstrate the superior coherence of WBToT over BToT, which highlights the importance of balancing the time and word modalities. Finally, we illustrate the stability of the online optimization algorithm in WBToT, which allows the application of WBToT to problems that are intractable for standard ToT.

Fully Bayesian Approaches to Topics over Time

TL;DR

The paper addresses instability in the Topics over Time (ToT) model by reformulating it in a fully Bayesian framework (BToT) through a conjugate Beta-prior over time-topic distributions. To balance the disproportionate influence of timestamps, it introduces Weighted BToT (WBToT), which repeats timestamp observations via an auxiliary topic assignment, enabling stable online inference. Using variational inference in batch and online modes, the authors demonstrate that WBToT improves event detection and maintains coherence on two large datasets (SOTU addresses and a COVID-19 Twitter corpus), while offering scalable online optimization absent in standard ToT. The work highlights the practical potential of fully Bayesian, time-aware topic models for event-centric analysis in long-running and large-scale timestamped corpora.

Abstract

The Topics over Time (ToT) model captures thematic changes in timestamped datasets by explicitly modeling publication dates jointly with word co-occurrence patterns. However, ToT was not approached in a fully Bayesian fashion, a flaw that makes it susceptible to stability problems. To address this issue, we propose a fully Bayesian Topics over Time (BToT) model via the introduction of a conjugate prior to the Beta distribution. This prior acts as a regularization that prevents the online version of the algorithm from unstable updates when a topic is poorly represented in a mini-batch. The characteristics of this prior to the Beta distribution are studied here for the first time. Still, this model suffers from a difference in scale between the single-time observations and the multiplicity of words per document. A variation of BToT, Weighted Bayesian Topics over Time (WBToT), is proposed as a solution. In WBToT, publication dates are repeated a certain number of times per document, which balances the relative influence of words and timestamps along the inference process. We have tested our models on two datasets: a collection of over 200 years of US state-of-the-union (SOTU) addresses and a large-scale COVID-19 Twitter corpus of 10 million tweets. The results show that WBToT captures events better than Latent Dirichlet Allocation and other SOTA topic models like BERTopic: the median absolute deviation of the topic presence over time is reduced by and , respectively. Our experiments also demonstrate the superior coherence of WBToT over BToT, which highlights the importance of balancing the time and word modalities. Finally, we illustrate the stability of the online optimization algorithm in WBToT, which allows the application of WBToT to problems that are intractable for standard ToT.

Paper Structure

This paper contains 33 sections, 65 equations, 12 figures, 5 tables, 7 algorithms.

Figures (12)

  • Figure 1: Bayesian Topics over Time (BToT)
  • Figure 2: Weighted Bayesian Topics over Time (WBToT)
  • Figure 4: Presence over time and 20 top words for the “ The Great Inflation” topic.
  • Figure 5: Presence over time and 20 top words for the “ Health care reform” topic.
  • Figure 6: Presence over time and 20 top words for the “ American Indians and conflicts with Mexico” topic.
  • ...and 7 more figures