Don't Trust Generative Agents to Mimic Communication on Social Networks Unless You Benchmarked their Empirical Realism
Simon Münker, Nils Schwager, Achim Rettinger
TL;DR
This work tackles whether generative agents can faithfully mimic social-network user communication for empirical studies. It introduces a formal Twins of Online Social Networks (TWON) framework to model both user behavior and platform mechanics, and executes a case study that imitates English and German X users across posting, replying, and reply-likelihood tasks. By fitting agents with content-generation and behavior-likelihood pipelines and benchmarking empirical realism via losses $L_b$ and $L_r$ and various discourse metrics, the study reveals strong language- and dataset-dependent realism: English models generally achieve higher realism, especially for replies, while German models show limited realism and higher variance. The authors conclude with concrete recommendations to validate realism in the fitting setting, advocate language- and community-specific modeling, and outline ethical considerations and future directions for robust, policy-relevant social simulations.
Abstract
The ability of Large Language Models (LLMs) to mimic human behavior triggered a plethora of computational social science research, assuming that empirical studies of humans can be conducted with AI agents instead. Since there have been conflicting research findings on whether and when this hypothesis holds, there is a need to better understand the differences in their experimental designs. We focus on replicating the behavior of social network users with the use of LLMs for the analysis of communication on social networks. First, we provide a formal framework for the simulation of social networks, before focusing on the sub-task of imitating user communication. We empirically test different approaches to imitate user behavior on X in English and German. Our findings suggest that social simulations should be validated by their empirical realism measured in the setting in which the simulation components were fitted. With this paper, we argue for more rigor when applying generative-agent-based modeling for social simulation.
