Table of Contents
Fetching ...

Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok

Benjamin Steel, Miriam Schirmer, Derek Ruths, Juergen Pfeffer

TL;DR

This work introduces ID-based sampling to obtain a near-complete slice of TikTok data, addressing the limitations of hashtag and API-based methods. By exploiting 64-bit post IDs whose higher bits encode creation time in a Snowflake-like fashion, the authors collect two representative datasets and estimate global post volume, engagement, and media characteristics, including AI-generated content and child-present content. The study reports a daily post volume of about $269.3$ million, $18.00 ext{ extendash}24.62 ext{%}$ of videos featuring children, and $0.5 ext{%}$ AI-generated content, along with comprehensive error analyses and global usage patterns. The work provides public data releases and code to calibrate and extend TikTok research, offering valuable priors for researchers and policymakers studying platform dynamics, safety, and misinformation.

Abstract

TikTok is now a massive platform, and has a deep impact on global events. Despite preliminary studies, issues remain in determining fundamental characteristics of the platform. We develop a method to extract a representative sample of >99% of posts from a given time range on TikTok, and use it to collect all posts from a full hour on the platform, alongside all posts from a single minute from each hour of a day. Through this, we obtain post metadata, video media, and comments from a close-to-complete slice of TikTok, and report the critical statistics of the platform. Notably, we estimate a total of 269 million posts produced on the day we looked at, that 18% of videos on the platform feature children, and that at least 0.5% of posts contain artificial intelligence-generated content.

Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok

TL;DR

This work introduces ID-based sampling to obtain a near-complete slice of TikTok data, addressing the limitations of hashtag and API-based methods. By exploiting 64-bit post IDs whose higher bits encode creation time in a Snowflake-like fashion, the authors collect two representative datasets and estimate global post volume, engagement, and media characteristics, including AI-generated content and child-present content. The study reports a daily post volume of about million, of videos featuring children, and AI-generated content, along with comprehensive error analyses and global usage patterns. The work provides public data releases and code to calibrate and extend TikTok research, offering valuable priors for researchers and policymakers studying platform dynamics, safety, and misinformation.

Abstract

TikTok is now a massive platform, and has a deep impact on global events. Despite preliminary studies, issues remain in determining fundamental characteristics of the platform. We develop a method to extract a representative sample of >99% of posts from a given time range on TikTok, and use it to collect all posts from a full hour on the platform, alongside all posts from a single minute from each hour of a day. Through this, we obtain post metadata, video media, and comments from a close-to-complete slice of TikTok, and report the critical statistics of the platform. Notably, we estimate a total of 269 million posts produced on the day we looked at, that 18% of videos on the platform feature children, and that at least 0.5% of posts contain artificial intelligence-generated content.

Paper Structure

This paper contains 23 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: TikTok posts per second and per minute on the 10th of April, 2024, 5pm--6pm UTC, as well as the estimated posts per hour over a 24 hour time period.
  • Figure 2: Histograms of various statistics in the post metadata, across the 1 hour dataset. We speculate that the peaks we see at round numbers on due to interval view count tracking optimization. However, they could also be evidence of inauthentic engagement.
  • Figure 3: Choropleth of posting per capita for all countries, determined via the locationCreated tag available on the post metadata. We normalize by the population of each country to show posts per capita, which specifically accounts for population rather than citizens un2024world. The counts are corrected for geographical regions that we estimate that we under-represent. Areas in grey are countries where we had less than 50 posts available.
  • Figure 4: Regression of country population against post count in this dataset, with the 6 countries furthest from the trend line labelled, as determined by highest and lowest residuals from the linear regression. Red points correspond to the country name closest to them.
  • Figure 5: Share of posts tagged as AI generated content. Areas in grey are countries where we had less than 50 posts available.
  • ...and 8 more figures