Table of Contents
Fetching ...

An Incremental Clustering Baseline for Event Detection on Twitter

Marjolaine Ray, Qi Wang, Frédérique Mélanie-Becquet, Thierry Poibeau, Béatrice Mazoyer

TL;DR

The paper tackles the challenge of detecting events in Twitter streams under high volume and linguistic variability. It introduces an incremental clustering baseline that combines mini-batch First Story Detection with Sentence Transformer embeddings to produce semantically informed event clusters without assuming a fixed number of events. Across English Event2012 and French Event2018 datasets, the approach outperforms a recent graph-based baseline and a tf-idf baseline, while offering superior time efficiency and lower memory usage, with a time complexity of $O(n w)$ and favorable batch-size trade-offs. The method provides a practical, scalable baseline for future event-detection systems and highlights the value of modern embeddings in streaming short-text analysis.

Abstract

Event detection in text streams is a crucial task for the analysis of online media and social networks. One of the current challenges in this field is establishing a performance standard while maintaining an acceptable level of computational complexity. In our study, we use an incremental clustering algorithm combined with recent advancements in sentence embeddings. Our objective is to compare our findings with previous studies, specifically those by Cao et al. (2024) and Mazoyer et al. (2020). Our results demonstrate significant improvements and could serve as a relevant baseline for future research in this area.

An Incremental Clustering Baseline for Event Detection on Twitter

TL;DR

The paper tackles the challenge of detecting events in Twitter streams under high volume and linguistic variability. It introduces an incremental clustering baseline that combines mini-batch First Story Detection with Sentence Transformer embeddings to produce semantically informed event clusters without assuming a fixed number of events. Across English Event2012 and French Event2018 datasets, the approach outperforms a recent graph-based baseline and a tf-idf baseline, while offering superior time efficiency and lower memory usage, with a time complexity of and favorable batch-size trade-offs. The method provides a practical, scalable baseline for future event-detection systems and highlights the value of modern embeddings in streaming short-text analysis.

Abstract

Event detection in text streams is a crucial task for the analysis of online media and social networks. One of the current challenges in this field is establishing a performance standard while maintaining an acceptable level of computational complexity. In our study, we use an incremental clustering algorithm combined with recent advancements in sentence embeddings. Our objective is to compare our findings with previous studies, specifically those by Cao et al. (2024) and Mazoyer et al. (2020). Our results demonstrate significant improvements and could serve as a relevant baseline for future research in this area.

Paper Structure

This paper contains 22 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Evolution of execution time and adjusted mutual information (AMI) of the "mini-batch" FSD algorithm depending on batch size $b$ on the entire Event2012 corpus (68,841 documents).
  • Figure 2: ARI and AMI scores with different SBERT models and different clustering algorithms. All FSD tests ran with $b=8$ and $t=0.55$.