Just Another Hour on TikTok: ID sampling to obtain a complete slice of TikTok
Benjamin Steel, Miriam Schirmer, Derek Ruths, Juergen Pfeffer
TL;DR
This work introduces ID-based sampling to obtain a near-complete slice of TikTok data, addressing the limitations of hashtag and API-based methods. By exploiting 64-bit post IDs whose higher bits encode creation time in a Snowflake-like fashion, the authors collect two representative datasets and estimate global post volume, engagement, and media characteristics, including AI-generated content and child-present content. The study reports a daily post volume of about $269.3$ million, $18.00 ext{ extendash}24.62 ext{%}$ of videos featuring children, and $0.5 ext{%}$ AI-generated content, along with comprehensive error analyses and global usage patterns. The work provides public data releases and code to calibrate and extend TikTok research, offering valuable priors for researchers and policymakers studying platform dynamics, safety, and misinformation.
Abstract
TikTok is now a massive platform, and has a deep impact on global events. Despite preliminary studies, issues remain in determining fundamental characteristics of the platform. We develop a method to extract a representative sample of >99% of posts from a given time range on TikTok, and use it to collect all posts from a full hour on the platform, alongside all posts from a single minute from each hour of a day. Through this, we obtain post metadata, video media, and comments from a close-to-complete slice of TikTok, and report the critical statistics of the platform. Notably, we estimate a total of 269 million posts produced on the day we looked at, that 18% of videos on the platform feature children, and that at least 0.5% of posts contain artificial intelligence-generated content.
