Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024)
Yan Aquino, Pedro Bento, Arthur Buzelin, Lucas Dayrell, Samira Malaquias, Caio Santana, Victoria Estanislau, Pedro Dutenhefner, Guilherme H. G. Evangelista, Luisa G. Porfírio, Caio Souza Grossi, Pedro B. Rigueira, Virgilio Almeida, Gisele L. Pappa, Wagner Meira
TL;DR
Discord Unveiled introduces the largest publicly available dataset of Discord public-server communication (2015–2024), comprising over 2.05 billion messages from roughly 4.74 million users across 3,167 servers. The dataset is collected via Discord's public API, organized per-server, and anonymized (pseudonyms for users, hashed IDs, removal of sensitive fields) to enable scalable computational social science analyses while preserving privacy. It enables robust study of decentralized moderation, governance, information diffusion, and multilingual linguistic patterns, with preliminary insights into bot usage and language distribution. By providing a FAIR, richly described resource with clear ethical safeguards, the work creates a foundational platform for analyzing online communities, cross-platform comparisons, and the impacts of user-driven moderation on digital social dynamics.
Abstract
Discord has evolved from a gaming-focused communication tool into a versatile platform supporting diverse online communities. Despite its large user base and active public servers, academic research on Discord remains limited due to data accessibility challenges. This paper introduces Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024), the most extensive Discord public server's data to date. The dataset comprises over 2.05 billion messages from 4.74 million users across 3,167 public servers, representing approximately 10% of servers listed in Discord's Discovery feature. Spanning from Discord's launch in 2015 to the end of 2024, it offers a robust temporal and thematic framework for analyzing decentralized moderation, community governance, information dissemination, and social dynamics. Data was collected through Discord's public API, adhering to ethical guidelines and privacy standards via anonymization techniques. Organized into structured JSON files, the dataset facilitates seamless integration with computational social science methodologies. Preliminary analyses reveal significant trends in user engagement, bot utilization, and linguistic diversity, with English predominating alongside substantial representations of Spanish, French, and Portuguese. Additionally, prevalent community themes such as social, art, music, and memes highlight Discord's expansion beyond its gaming origins.
