Table of Contents
Fetching ...

GET-Tok: A GenAI-Enriched Multimodal TikTok Dataset Documenting the 2022 Attempted Coup in Peru

Gabriela Pinto, Keith Burghardt, Kristina Lerman, Emilio Ferrara

TL;DR

GET-Tok introduces a Generative AI-Enriched TikTok data pipeline that augments the TikTok Research API with Whisper transcripts, GPT-4-based video descriptions, and a multimodal stance detector to enable robust multimodal analysis of non-English political discourse. The Peru case study yields 43,697 videos with augmented transcripts and descriptions, illustrating scalable integration of audio, visual, and textual signals for social science research. The approach highlights both the benefits of AI-assisted data enrichment and challenges related to data availability, model biases, and ethical considerations, while offering public data and code to foster reproducibility. By enabling multilingual, multimodal analysis of online narratives and their offline manifestations, the work provides a valuable resource for studying digital platforms' role in political processes.

Abstract

TikTok is one of the largest and fastest-growing social media sites in the world. TikTok features, however, such as voice transcripts, are often missing and other important features, such as OCR or video descriptions, do not exist. We introduce the Generative AI Enriched TikTok (GET-Tok) data, a pipeline for collecting TikTok videos and enriched data by augmenting the TikTok Research API with generative AI models. As a case study, we collect videos about the attempted coup in Peru initiated by its former President, Pedro Castillo, and its accompanying protests. The data includes information on 43,697 videos published from November 20, 2022 to March 1, 2023 (102 days). Generative AI augments the collected data via transcripts of TikTok videos, text descriptions of what is shown in the videos, what text is displayed within the video, and the stances expressed in the video. Overall, this pipeline will contribute to a better understanding of online discussion in a multimodal setting with applications of Generative AI, especially outlining the utility of this pipeline in non-English-language social media. Our code used to produce the pipeline is in a public Github repository: https://github.com/gabbypinto/GET-Tok-Peru.

GET-Tok: A GenAI-Enriched Multimodal TikTok Dataset Documenting the 2022 Attempted Coup in Peru

TL;DR

GET-Tok introduces a Generative AI-Enriched TikTok data pipeline that augments the TikTok Research API with Whisper transcripts, GPT-4-based video descriptions, and a multimodal stance detector to enable robust multimodal analysis of non-English political discourse. The Peru case study yields 43,697 videos with augmented transcripts and descriptions, illustrating scalable integration of audio, visual, and textual signals for social science research. The approach highlights both the benefits of AI-assisted data enrichment and challenges related to data availability, model biases, and ethical considerations, while offering public data and code to foster reproducibility. By enabling multilingual, multimodal analysis of online narratives and their offline manifestations, the work provides a valuable resource for studying digital platforms' role in political processes.

Abstract

TikTok is one of the largest and fastest-growing social media sites in the world. TikTok features, however, such as voice transcripts, are often missing and other important features, such as OCR or video descriptions, do not exist. We introduce the Generative AI Enriched TikTok (GET-Tok) data, a pipeline for collecting TikTok videos and enriched data by augmenting the TikTok Research API with generative AI models. As a case study, we collect videos about the attempted coup in Peru initiated by its former President, Pedro Castillo, and its accompanying protests. The data includes information on 43,697 videos published from November 20, 2022 to March 1, 2023 (102 days). Generative AI augments the collected data via transcripts of TikTok videos, text descriptions of what is shown in the videos, what text is displayed within the video, and the stances expressed in the video. Overall, this pipeline will contribute to a better understanding of online discussion in a multimodal setting with applications of Generative AI, especially outlining the utility of this pipeline in non-English-language social media. Our code used to produce the pipeline is in a public Github repository: https://github.com/gabbypinto/GET-Tok-Peru.
Paper Structure (10 sections, 8 figures, 2 tables)

This paper contains 10 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Top 10 Most Frequent and Viewed Hashtags
  • Figure 2: Timeline of events and volume of TikTok posts.
  • Figure 3: Share of video transcripts extracted by TikTok and Whisper.
  • Figure 4: Transcripts from TikTok and Whisper with yellow for capitalization, red for punctuation, blue for spelling discrepancies.
  • Figure : (A) Vamos Pedro castillo el pueblo está contigo #parati #viralvideo #tiktok #peruanadas #love #pedrocastillopresidente2021 Translation: Let's go Pedro Castillo the people are with youTranslation (screen): Let's go Pedro Castillo
  • ...and 3 more figures