TGDataset: Collecting and Exploring the Largest Telegram Channels Dataset
Massimo La Morgia, Alessandro Mei, Alberto Maria Mongardini
TL;DR
The paper addresses the need to study Telegram's ecosystem at scale by introducing TGDataset, the largest publicly available collection of Telegram channels (≈120k channels, ≈498M messages). It describes a snowball data-collection pipeline using seed channels and forwarding relationships to expand coverage, and it releases the data under FAIR principles with accompanying analysis scripts. Through language detection and LDA topic modeling, it characterizes dominant languages (notably Russian and Farsi) and English-language topics, while revealing emergent themes such as carding and extremist content, and it identifies a Sabmyk conspiracy-network cluster via network analysis. The resource enables research on misinformation diffusion, conspiracy networks, and platform dynamics, providing a substantial foundation for future studies and moderation insights on Telegram.
Abstract
Telegram is one of the most popular instant messaging apps in today's digital age. In addition to providing a private messaging service, Telegram, with its channels, represents a valid medium for rapidly broadcasting content to a large audience (COVID-19 announcements), but, unfortunately, also for disseminating radical ideologies and coordinating attacks (Capitol Hill riot). This paper presents the TGDataset, a new dataset that includes 120,979 Telegram channels and over 400 million messages, making it the largest collection of Telegram channels to the best of our knowledge. After a brief introduction to the data collection process, we analyze the languages spoken within our dataset and the topic covered by English channels. Finally, we discuss some use cases in which our dataset can be extremely useful to understand better the Telegram ecosystem, as well as to study the diffusion of questionable news. In addition to the raw dataset, we released the scripts we used to analyze the dataset and the list of channels belonging to the network of a new conspiracy theory called Sabmyk.
