Table of Contents
Fetching ...

Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

Henry Tari, Danial Khan, Justus Rutten, Darian Othman, Rishabh Kaushal, Thales Bertaglia, Adriana Iamnitchi

TL;DR

This work tackles the challenge of obtaining authentic, multi-platform social media data by examining whether GPT-based generation can yield high-fidelity synthetic datasets across six major platforms. The authors compare platform-aware and platform-agnostic prompting strategies, using GPT-3.5-turbo to produce 1000 posts per platform and evaluate lexical features, sentiment, topics, and embedding similarity against two real datasets (US 2022 elections and Dutch influencers). Key findings show strong fidelity in lexical signals and semantic alignment of topics, but underrepresentation of user tags/URLs and a tendency toward more positive sentiment, with embedding-based similarity varying by platform. The results suggest that synthetic multi-platform data can aid reproducibility and accessibility, while highlighting areas for improvement in prompting, post-processing, and privacy-conscious data sharing.

Abstract

Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.

Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research

TL;DR

This work tackles the challenge of obtaining authentic, multi-platform social media data by examining whether GPT-based generation can yield high-fidelity synthetic datasets across six major platforms. The authors compare platform-aware and platform-agnostic prompting strategies, using GPT-3.5-turbo to produce 1000 posts per platform and evaluate lexical features, sentiment, topics, and embedding similarity against two real datasets (US 2022 elections and Dutch influencers). Key findings show strong fidelity in lexical signals and semantic alignment of topics, but underrepresentation of user tags/URLs and a tendency toward more positive sentiment, with embedding-based similarity varying by platform. The results suggest that synthetic multi-platform data can aid reproducibility and accessibility, while highlighting areas for improvement in prompting, post-processing, and privacy-conscious data sharing.

Abstract

Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical findings suggest that using large language models to generate synthetic multi-platform social media data is promising. However, further enhancements are necessary to improve the fidelity of the outputs.
Paper Structure (11 sections, 7 figures, 4 tables)

This paper contains 11 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Pipeline of our methodology for generating and evaluating synthetic social media datasets.
  • Figure 2: Topic overlap among platforms in the real and synthetic datasets. Platform-agnostic/aware prompts, $P=1$, $T=1$.
  • Figure 3: Topic overlap between real and synthetic (platform agnostic) data on the US elections (a). Word clouds of unique topics in the real dataset (b), common topics (c), and new topics in the synthetic data (d) on US elections.
  • Figure 4: Topic overlap between real and synthetic (platform aware) data on the Dutch influencers (a). Word clouds of unique topics in the real dataset (b), common topics (c), and new topics in the synthetic data (d) on Dutch influencers.
  • Figure 5: t-SNE plots of embedding vectors (1k data points are clustered into 50 clusters, and cluster centroids are plotted) of real, and synthetic (platform agnostic and platform aware).
  • ...and 2 more figures