Table of Contents
Fetching ...

"I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data

Andrea Failla, Giulio Rossetti

TL;DR

This dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts, and provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.

Abstract

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped ``like'' interactions and time of bookmarking. This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.

"I'm in the Bluesky Tonight": Insights from a Year Worth of Social Data

TL;DR

This dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts, and provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.

Abstract

Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. We present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social to address this pressing issue. The dataset contains the complete post history of over 4M users (81% of all registered accounts), totalling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions. Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped ``like'' interactions and time of bookmarking. This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection and performing content virality and diffusion analysis.
Paper Structure (13 sections, 1 equation, 6 figures, 3 tables)

This paper contains 13 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Screenshots of the home (left), feeds (middle), and profile (right) tabs from Bluesky's official iOS app (v1.71). In the home tab, the top row is a scrollable bar listing the user's bookmarked feeds. The post at the top only contains an image and received 12 comments, 167 reposts, and 1447 likes. The post at the bottom contains both text and an image. The feeds tab contains the list of bookmarked feed generators, along with a feed search bar. Finally, the user tab shows the logged-in user's profile.
  • Figure 2: Most populated and active instances. Values are scaled logarithmically
  • Figure 3: Posts per day (a) and posts per user (b) trends. In the latter plot, the red line represents the average value, the blue area represents the interquartile range of daily posts per user, and the blue line represents the mean computed over the interquartile range
  • Figure 4: Cumulative Distribution Function of the inter-event time from users' first to last posts (days).
  • Figure 5: Temporal trends of English post sentiment.
  • ...and 1 more figures