Table of Contents
Fetching ...

What we can learn from TikTok through its Research API

Francesco Corso, Francesco Pierri, Gianmarco De Francisci Morales

TL;DR

This study evaluates the reliability and utility of TikTok's official Research API by constructing a random, monthly-stratified sample of over 500k videos spanning 2018–2023. It analyzes API quotas, data availability, temporal patterns, regional distribution, and engagement metrics, revealing substantial quota shortfalls and notable 2018 data gaps. The results show a strong regional skew toward Asia (with India leading) and a measurable engagement uplift for videos employing viral hashtags, while conspiracy-hashtag prevalence appears limited. The findings offer practical guidance for researchers using the API and highlight biases and data-quality concerns that influence API-based inference and the need for improved transparency.

Abstract

TikTok is a social media platform that has gained immense popularity over the last few years, particularly among younger demographics, due to the viral trends and challenges shared worldwide. The recent release of a free Research API opens the door to collecting data on posted videos, associated comments, and user activities. Our study focuses on evaluating the reliability of the results returned by the Research API, by collecting and analyzing a random sample of TikTok videos posted in a span of 6 years. Our preliminary results are instrumental for future research that aims to study the platform, highlighting caveats on the geographical distribution of videos and on the global prevalence of viral and conspiratorial hashtags.

What we can learn from TikTok through its Research API

TL;DR

This study evaluates the reliability and utility of TikTok's official Research API by constructing a random, monthly-stratified sample of over 500k videos spanning 2018–2023. It analyzes API quotas, data availability, temporal patterns, regional distribution, and engagement metrics, revealing substantial quota shortfalls and notable 2018 data gaps. The results show a strong regional skew toward Asia (with India leading) and a measurable engagement uplift for videos employing viral hashtags, while conspiracy-hashtag prevalence appears limited. The findings offer practical guidance for researchers using the API and highlight biases and data-quality concerns that influence API-based inference and the need for improved transparency.

Abstract

TikTok is a social media platform that has gained immense popularity over the last few years, particularly among younger demographics, due to the viral trends and challenges shared worldwide. The recent release of a free Research API opens the door to collecting data on posted videos, associated comments, and user activities. Our study focuses on evaluating the reliability of the results returned by the Research API, by collecting and analyzing a random sample of TikTok videos posted in a span of 6 years. Our preliminary results are instrumental for future research that aims to study the platform, highlighting caveats on the geographical distribution of videos and on the global prevalence of viral and conspiratorial hashtags.
Paper Structure (10 sections, 8 figures, 1 table)

This paper contains 10 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Time series of the data collection. The blue line represents the theoretical quota (maximum number of videos obtainable with the given number of API calls), while the histogram shows the obtained quota per month.
  • Figure 2: Number of videos posted (a) for each day of the month, (b) for each day of the week, (c) for each hour of the day (UTC), and (d) for each minute of the hour. The API shows a bias at the daily level, but not at the minute level.
  • Figure 3: CCDFs of the four main interactions on TikTok: number of views, of likes, of shares, and of comments for videos per year. All the features have a heavy-tailed distribution. The yearly platform growth is evident in the shift to the right of each feature. Axes are on a logarithmic scale.
  • Figure 4: Top 10 regions by prevalence in the dataset with relative percentage of prevalence in the sample. India is still the largest one historically, despite the ban in 2020.
  • Figure 5: Yearly prevalence of the top 10 regions in our sample. The light-grey area represents all the other regions collected. Most countries in the top 10 are in Asia.
  • ...and 3 more figures