Table of Contents
Fetching ...

Towards Operational Validation of LLM-Agent Social Simulations: A Replicated Study of a Reddit-like Technology Forum

Aleksandar Tomašević, Darja Cvetković, Sara Major, Slobodan Maletić, Miroslav Anđelković, Ana Vranić, Boris Stupovski, Dušan Vudragović, Aleksandar Bogojević, Marija Mitrović Dankulov

TL;DR

The paper tackles the challenge of validating LLM‑driven social simulations by benchmarking a Voat‑like technology forum against real‑world MADOC calibration data. It treats LLMs as norm‑guided cultural technologies operating in stateless, micro‑dialogue interactions within a thread‑and‑feed platform (YSocial) and validates the emergent patterns across 30 independent 30‑day replications. The study reports on activity trajectories, network core–periphery structure, toxicity distributions, topic overlap, and embedding similarity, providing a reproducible recipe and explicit discussion of limitations and next steps (e.g., memory integration, feed variations). The findings suggest that, even in a memoryless setting, LLM agents can reproduce many macro‑scale platform patterns while highlighting systematic divergences in activity volume, core coupling, and toxicity, which inform future refinements for more faithful moderation‑aware simulations with broader applicability.

Abstract

Large Language Models (LLMs) enable generative social simulations that can capture culturally informed, norm-guided interaction on online social platforms. We build a technology community simulation modeled on Voat, a Reddit-like alt-right news aggregator and discussion platform active from 2014 to 2020. Using the YSocial framework, we seed the simulation with a fixed catalog of technology links sampled from Voat's shared URLs (covering 30+ domains) and calibrate parameters to Voat's v/technology using samples from the MADOC dataset. Agents use a base, uncensored model (Dolphin 3.0, based on Llama 3.1 8B) and concise personas (demographics, political leaning, interests, education, toxicity propensity) to generate posts, replies, and reactions under platform rules for link and text submissions, threaded replies and daily activity cycles. We run a 30-day simulation and evaluate operational validity by comparing distributions and structures with matched Voat data: activity patterns, interaction networks, toxicity, and topic coverage. Results indicate familiar online regularities: similar activity rhythms, heavy-tailed participation, sparse low-clustering interaction networks, core-periphery structure, topical alignment with Voat, and elevated toxicity. Limitations of the current study include the stateless agent design and evaluation based on a single 30-day run, which constrains external validity and variance estimates. The simulation generates realistic discussions, often featuring toxic language, primarily centered on technology topics such as Big Tech and AI. This approach offers a valuable method for examining toxicity dynamics and testing moderation strategies within a controlled environment.

Towards Operational Validation of LLM-Agent Social Simulations: A Replicated Study of a Reddit-like Technology Forum

TL;DR

The paper tackles the challenge of validating LLM‑driven social simulations by benchmarking a Voat‑like technology forum against real‑world MADOC calibration data. It treats LLMs as norm‑guided cultural technologies operating in stateless, micro‑dialogue interactions within a thread‑and‑feed platform (YSocial) and validates the emergent patterns across 30 independent 30‑day replications. The study reports on activity trajectories, network core–periphery structure, toxicity distributions, topic overlap, and embedding similarity, providing a reproducible recipe and explicit discussion of limitations and next steps (e.g., memory integration, feed variations). The findings suggest that, even in a memoryless setting, LLM agents can reproduce many macro‑scale platform patterns while highlighting systematic divergences in activity volume, core coupling, and toxicity, which inform future refinements for more faithful moderation‑aware simulations with broader applicability.

Abstract

Large Language Models (LLMs) enable generative social simulations that can capture culturally informed, norm-guided interaction on online social platforms. We build a technology community simulation modeled on Voat, a Reddit-like alt-right news aggregator and discussion platform active from 2014 to 2020. Using the YSocial framework, we seed the simulation with a fixed catalog of technology links sampled from Voat's shared URLs (covering 30+ domains) and calibrate parameters to Voat's v/technology using samples from the MADOC dataset. Agents use a base, uncensored model (Dolphin 3.0, based on Llama 3.1 8B) and concise personas (demographics, political leaning, interests, education, toxicity propensity) to generate posts, replies, and reactions under platform rules for link and text submissions, threaded replies and daily activity cycles. We run a 30-day simulation and evaluate operational validity by comparing distributions and structures with matched Voat data: activity patterns, interaction networks, toxicity, and topic coverage. Results indicate familiar online regularities: similar activity rhythms, heavy-tailed participation, sparse low-clustering interaction networks, core-periphery structure, topical alignment with Voat, and elevated toxicity. Limitations of the current study include the stateless agent design and evaluation based on a single 30-day run, which constrains external validity and variance estimates. The simulation generates realistic discussions, often featuring toxic language, primarily centered on technology topics such as Big Tech and AI. This approach offers a valuable method for examining toxicity dynamics and testing moderation strategies within a controlled environment.

Paper Structure

This paper contains 41 sections, 2 equations, 13 figures, 21 tables.

Figures (13)

  • Figure 1: Cumulative activity growth over 30 days across 30 simulation runs (shaded bands show 5th--95th percentile range). All metrics show consistent growth trajectories with tight confidence intervals, demonstrating robustness across replications.
  • Figure 2: KDE of log posts per user (computed as $\log(1+\mathrm{posts})$ to handle zeros): simulation (30 runs) vs. Voat (30 samples). Solid curves show the mean density; shaded bands show the 5th--95th percentile range across runs/samples. Both corpora exhibit heavy participation skew with a long right tail on a log scale.
  • Figure 3: Degree distribution (log-log) for 60 networks: 30 simulation runs and 30 Voat samples. Both show heavy-tailed distributions consistent with participation inequality.
  • Figure 4: Core– periphery structure on the largest connected component: simulation vs. matched real Voat sample.
  • Figure 5: Toxicity score distributions (KDE) pooled across 30 simulation runs vs. 30 Voat samples.
  • ...and 8 more figures