Table of Contents
Fetching ...

WikiReddit: Tracing Information and Attention Flows Between Online Platforms

Patrick Gildersleve, Anna Beers, Viviane Ito, Agustin Orozco, Francesca Tripodi

TL;DR

WikiReddit provides a multilingual, long-span dataset linking Reddit posts and comments to Wikipedia articles from 2020–2023, enriched with revision histories, page views, redirects, and Wikidata identifiers within a privacy-preserving SQLite3 schema. The work demonstrates cross-platform information flows, showing that Reddit mentions of Wikipedia modestly boost Wikipedia page views on posting days while edits show weaker signals, and highlights strong English-language dominance with meaningful cross-language linking. This resource enables researchers to study cross-platform attention, knowledge consumption, and multilingual information dynamics at scale, under open licensing and with FAIR-compliant data sharing. The dataset supports longitudinal analyses and offers a foundation for understanding how social discourse interacts with collaborative knowledge production across platforms.

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia mentions and links shared in posts and comments on Reddit 2020-2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

WikiReddit: Tracing Information and Attention Flows Between Online Platforms

TL;DR

WikiReddit provides a multilingual, long-span dataset linking Reddit posts and comments to Wikipedia articles from 2020–2023, enriched with revision histories, page views, redirects, and Wikidata identifiers within a privacy-preserving SQLite3 schema. The work demonstrates cross-platform information flows, showing that Reddit mentions of Wikipedia modestly boost Wikipedia page views on posting days while edits show weaker signals, and highlights strong English-language dominance with meaningful cross-language linking. This resource enables researchers to study cross-platform attention, knowledge consumption, and multilingual information dynamics at scale, under open licensing and with FAIR-compliant data sharing. The dataset supports longitudinal analyses and offers a foundation for understanding how social discourse interacts with collaborative knowledge production across platforms.

Abstract

The World Wide Web is a complex interconnected digital ecosystem, where information and attention flow between platforms and communities throughout the globe. These interactions co-construct how we understand the world, reflecting and shaping public discourse. Unfortunately, researchers often struggle to understand how information circulates and evolves across the web because platform-specific data is often siloed and restricted by linguistic barriers. To address this gap, we present a comprehensive, multilingual dataset capturing all Wikipedia mentions and links shared in posts and comments on Reddit 2020-2023, excluding those from private and NSFW subreddits. Each linked Wikipedia article is enriched with revision history, page view data, article ID, redirects, and Wikidata identifiers. Through a research agreement with Reddit, our dataset ensures user privacy while providing a query and ID mechanism that integrates with the Reddit and Wikipedia APIs. This enables extended analyses for researchers studying how information flows across platforms. For example, Reddit discussions use Wikipedia for deliberation and fact-checking which subsequently influences Wikipedia content, by driving traffic to articles or inspiring edits. By analyzing the relationship between information shared and discussed on these platforms, our dataset provides a foundation for examining the interplay between social media discourse and collaborative knowledge consumption and production.

Paper Structure

This paper contains 18 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Histograms for the Reddit score of the posts and comments that mention Wikipedia (in text or as a link).
  • Figure 2: Plot showing the daily count of posts and comments that mention Wikipedia (in text or as a link) over 2020-2023.
  • Figure 3: Plot showing the daily average Reddit score of the posts and comments that mention Wikipedia (in text or as a link) over 2020--2023.
  • Figure 4: Figure showing the daily page views to Wikipedia articles on the day of posting and in the week after posting relative to the week before posting. A small number of points in the extremes of the distributions are cut for visual clarity.
  • Figure 5: Figure showing the proportion of links to each Wikipedia language subdomain for the most frequently occurring languages in the dataset.
  • ...and 1 more figures