Table of Contents
Fetching ...

iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023

Jay Patel, Pujan Paudel, Emiliano De Cristofaro, Gianluca Stringhini, Jeremy Blackburn

TL;DR

The paper tackles understanding how banned Reddit communities migrate to alternative platforms and how these migrations influence discourse and real-world events. It introduces iDRAMA-Scored-2024, a large-scale dataset from the Scored platform spanning 2020–2023, containing roughly 57 million posts across hundreds of communities, plus 48 million sentence embeddings generated with the INSTRUCTOR model. The authors release the dataset under FAIR principles on Zenodo and HuggingFace, along with embedding representations and a data-access toolkit, enabling researchers to study platform migration, radicalization patterns, and information flows without heavy data collection. They also characterize posting activity, migrant communities, and the web-link ecosystem, providing insights into how fringe communities interact with broader web content and major political events such as the 2020 U.S. election and Capitol riot, with implications for understanding misinformation, conspiracies, and online harassment dynamics.

Abstract

Online web communities often face bans for violating platform policies, encouraging their migration to alternative platforms. This migration, however, can result in increased toxicity and unforeseen consequences on the new platform. In recent years, researchers have collected data from many alternative platforms, indicating coordinated efforts leading to offline events, conspiracy movements, hate speech propagation, and harassment. Thus, it becomes crucial to characterize and understand these alternative platforms. To advance research in this direction, we collect and release a large-scale dataset from Scored -- an alternative Reddit platform that sheltered banned fringe communities, for example, c/TheDonald (a prominent right-wing community) and c/GreatAwakening (a conspiratorial community). Over four years, we collected approximately 57M posts from Scored, with at least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. Furthermore, we provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model, to further advance the field in characterizing the discussions within these communities. We aim to provide these resources to facilitate their investigations without the need for extensive data collection and processing efforts.

iDRAMA-Scored-2024: A Dataset of the Scored Social Media Platform from 2020 to 2023

TL;DR

The paper tackles understanding how banned Reddit communities migrate to alternative platforms and how these migrations influence discourse and real-world events. It introduces iDRAMA-Scored-2024, a large-scale dataset from the Scored platform spanning 2020–2023, containing roughly 57 million posts across hundreds of communities, plus 48 million sentence embeddings generated with the INSTRUCTOR model. The authors release the dataset under FAIR principles on Zenodo and HuggingFace, along with embedding representations and a data-access toolkit, enabling researchers to study platform migration, radicalization patterns, and information flows without heavy data collection. They also characterize posting activity, migrant communities, and the web-link ecosystem, providing insights into how fringe communities interact with broader web content and major political events such as the 2020 U.S. election and Capitol riot, with implications for understanding misinformation, conspiracies, and online harassment dynamics.

Abstract

Online web communities often face bans for violating platform policies, encouraging their migration to alternative platforms. This migration, however, can result in increased toxicity and unforeseen consequences on the new platform. In recent years, researchers have collected data from many alternative platforms, indicating coordinated efforts leading to offline events, conspiracy movements, hate speech propagation, and harassment. Thus, it becomes crucial to characterize and understand these alternative platforms. To advance research in this direction, we collect and release a large-scale dataset from Scored -- an alternative Reddit platform that sheltered banned fringe communities, for example, c/TheDonald (a prominent right-wing community) and c/GreatAwakening (a conspiratorial community). Over four years, we collected approximately 57M posts from Scored, with at least 58 communities identified as migrating from Reddit and over 950 communities created since the platform's inception. Furthermore, we provide sentence embeddings of all posts in our dataset, generated through a state-of-the-art model, to further advance the field in characterizing the discussions within these communities. We aim to provide these resources to facilitate their investigations without the need for extensive data collection and processing efforts.
Paper Structure (17 sections, 8 figures, 4 tables)

This paper contains 17 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Visual appearance of c/TheDonald's home page on Scored platform.
  • Figure 2: Evolution of Scored.
  • Figure 3: Temporal evolution of daily activity in our dataset: (a) daily number of submissions/comments; and (b) daily number of submissions categorized in different types. X-axis shows daily ticks with a 3-month time interval and Y-axis shows the frequency.
  • Figure 4: Temporal evolution of daily activity (submissions + comments) in top 15 communities. X-axis shows daily ticks and Y-axis shows the total number of normalized posts (community-wise normalized by dividing max frequency).
  • Figure 5: CDF of number of submissions per user
  • ...and 3 more figures