Table of Contents
Fetching ...

$\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

Aurélien Bück-Kaeffer, Je Qin Chooi, Dan Zhao, Maximilian Puelma Touzel, Kellin Pelrine, Jean-François Godbout, Reihaneh Rabbany, Zachary Yang

TL;DR

This work introduces SIMPACT, a privacy-preserving framework for constructing behaviorally grounded social media datasets, and BluePrint, a Bluesky-derived dataset for political discourse used to train and evaluate LLM-based social media agents through next-action prediction. The pipeline clusters users into behavioral personas, removes PII, obfuscates timestamps, and uses a rich action set within thread-based structures to preserve interaction dynamics while protecting privacy. Evaluation across multiple models shows that fine-tuning improves linguistic and distributional fidelity, yet action-level realism remains challenging, with human studies indicating models become increasingly hard to distinguish from real users but still fall short on capturing nuanced community behaviors. The work provides a standardized, ethically aware benchmark for political discourse modeling and a template for domain-specific datasets aimed at mitigating misinformation and polarization in large-scale social simulations.

Abstract

Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.

$\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training

TL;DR

This work introduces SIMPACT, a privacy-preserving framework for constructing behaviorally grounded social media datasets, and BluePrint, a Bluesky-derived dataset for political discourse used to train and evaluate LLM-based social media agents through next-action prediction. The pipeline clusters users into behavioral personas, removes PII, obfuscates timestamps, and uses a rich action set within thread-based structures to preserve interaction dynamics while protecting privacy. Evaluation across multiple models shows that fine-tuning improves linguistic and distributional fidelity, yet action-level realism remains challenging, with human studies indicating models become increasingly hard to distinguish from real users but still fall short on capturing nuanced community behaviors. The work provides a standardized, ethically aware benchmark for political discourse modeling and a template for domain-specific datasets aimed at mitigating misinformation and polarization in large-scale social simulations.

Abstract

Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethically or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.

Paper Structure

This paper contains 31 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: SIMPACT (Simulation-oriented Persona and Action Capture Toolkit) for constructing privacy-preserving social media user datasets. We collect raw social media data, generate user embeddings, and cluster users into behavioral archetypes (e.g., fans, political groups) using constrained K-means. Privacy is preserved through PII removal, timestamp obfuscation, and per-thread pseudonymization of user IDs. The resulting dataset contains anonymized text, actions, and behavioral clusters suitable for social media agent simulation.
  • Figure 2: Human accuracy in distinguishing real from model-generated posts (lower is better; 95% CI). A score of 0.5 represents random guessing, meaning the model is fully indistinguishable from real users. Models fine-tuned on BluePrint are significantly harder to identify, with our best achieving near-random distinguishability (56.0%).
  • Figure 3: Layout of the questions as presented to participants
  • Figure 4: Visualization of user embeddings (average of all of a user's posts' embedding vectors) of 80,000 randomly selected users.