Towards Operational Validation of LLM-Agent Social Simulations: A Replicated Study of a Reddit-like Technology Forum
Aleksandar Tomašević, Darja Cvetković, Sara Major, Slobodan Maletić, Miroslav Anđelković, Ana Vranić, Boris Stupovski, Dušan Vudragović, Aleksandar Bogojević, Marija Mitrović Dankulov
TL;DR
The paper tackles the challenge of validating LLM‑driven social simulations by benchmarking a Voat‑like technology forum against real‑world MADOC calibration data. It treats LLMs as norm‑guided cultural technologies operating in stateless, micro‑dialogue interactions within a thread‑and‑feed platform (YSocial) and validates the emergent patterns across 30 independent 30‑day replications. The study reports on activity trajectories, network core–periphery structure, toxicity distributions, topic overlap, and embedding similarity, providing a reproducible recipe and explicit discussion of limitations and next steps (e.g., memory integration, feed variations). The findings suggest that, even in a memoryless setting, LLM agents can reproduce many macro‑scale platform patterns while highlighting systematic divergences in activity volume, core coupling, and toxicity, which inform future refinements for more faithful moderation‑aware simulations with broader applicability.
Abstract
Large Language Models (LLMs) enable generative social simulations that can capture culturally informed, norm-guided interaction on online social platforms. We build a technology community simulation modeled on Voat, a Reddit-like alt-right news aggregator and discussion platform active from 2014 to 2020. Using the YSocial framework, we seed the simulation with a fixed catalog of technology links sampled from Voat's shared URLs (covering 30+ domains) and calibrate parameters to Voat's v/technology using samples from the MADOC dataset. Agents use a base, uncensored model (Dolphin 3.0, based on Llama 3.1 8B) and concise personas (demographics, political leaning, interests, education, toxicity propensity) to generate posts, replies, and reactions under platform rules for link and text submissions, threaded replies and daily activity cycles. We run a 30-day simulation and evaluate operational validity by comparing distributions and structures with matched Voat data: activity patterns, interaction networks, toxicity, and topic coverage. Results indicate familiar online regularities: similar activity rhythms, heavy-tailed participation, sparse low-clustering interaction networks, core-periphery structure, topical alignment with Voat, and elevated toxicity. Limitations of the current study include the stateless agent design and evaluation based on a single 30-day run, which constrains external validity and variance estimates. The simulation generates realistic discussions, often featuring toxic language, primarily centered on technology topics such as Big Tech and AI. This approach offers a valuable method for examining toxicity dynamics and testing moderation strategies within a controlled environment.
