Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Xuhui Zhou; Zhe Su; Tiwalayo Eisape; Hyunwoo Kim; Maarten Sap

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, Maarten Sap

TL;DR

This paper interrogates the realism of LLM-based social simulations by contrasting Script (omniscient) and Agents (information-asymmetric) modes within a Sotopia-inspired framework. It shows that Script mode substantially inflates goal success and naturalness compared to Agents, revealing a core challenge: information asymmetry in realistic human interactions. The authors further test learning from Script-generated data via finetuning and find selective improvements accompanied by biases that degrade generalization to real-world settings. They propose reporting standards via a Simulation Card and outline avenues to improve realism, such as modeling theory of mind and external context rather than relying on omniscient access. Overall, the work clarifies the limits of current simulation paradigms and provides practical guidelines for more credible training and evaluation of AI agents in social tasks.

Abstract

Recent advances in large language models (LLM) have enabled richer social simulations, allowing for the study of various social phenomena. However, most recent work has used a more omniscient perspective on these simulations (e.g., single LLM to generate all interlocutors), which is fundamentally at odds with the non-omniscient, information asymmetric interactions that involve humans and AI agents in the real world. To examine these differences, we develop an evaluation framework to simulate social interactions with LLMs in various settings (omniscient, non-omniscient). Our experiments show that LLMs perform better in unrealistic, omniscient simulation settings but struggle in ones that more accurately reflect real-world conditions with information asymmetry. Our findings indicate that addressing information asymmetry remains a fundamental challenge for LLM-based agents.

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

TL;DR

Abstract

Paper Structure (42 sections, 12 figures, 2 tables)

This paper contains 42 sections, 12 figures, 2 tables.

Introduction
Background & Related Work
Simulating Society for Analysis
Simulating Interactions for Training
Information Asymmetry in Communication
Script vs Agents Simulation
The Unified Framework for Simulation
Social Scenarios
Characters
Simulation Modes
Simulation Evaluation
Experimental setup
RQ1: Script mode overestimates LLMs' ability to achieve social goals
RQ2: Script mode overstates LLMs' capability of natural interactions
Learning from Generated Stories
...and 27 more sections

Figures (12)

Figure 1: An illustration between Script mode simulation and Agents mode simulation. In the Agents mode, two agents, each equipped with an LLM, negotiate and strategically seek information to reach a mutual agreement. Conversely, in Script mode, a single omniscient LLM orchestrates the entire interaction based on full access to the agents' goals. These two modes end up on opposite sides of the spectrum in terms of information asymmetry from various perspectives (e.g., roles, social goals, secrets, etc.).
Figure 2: Average goal completion score of models across different modes in various settings. Overall contains all the scenarios, and the other two contains representative scenarios from the cooperative and competitive scenarios. We perform pairwise t-test, and * denotes the score is statistical significantly different from the other two modes in this setting ($p<0.001$).
Figure 3: Illustrative examples of the generated interactions from different simulation settings. All the examples are generated by GPT-3.5. Note that our actual prompts are more complex than the content in the green box (see Appendix \ref{['appendix:full_prompt']}). We observe: (1) Script simulations contain more non-verbal communication in the simulation; (2) agent-based simulations tend to generate more repetitive utterances.
Figure 4: The naturalness win rate between the Script and the Agents simulations as determined by human raters. The average length of each turn in the interactions from the two modes is also shown (verbosity). We perform a pairwise t-test, and * denotes statistical significance at $p<0.001$.
Figure 5: GPT-3.5's performance on the Agents mode before (Agent) and after finetuning (Agents-ft) as well as the Script mode (Script). Overall contains all the scenarios, and the other two contain representative scenarios from the cooperative and competitive scenarios. We perform a pairwise t-test, and * denotes the score is significantly different from the other two settings ($p<0.001$).
...and 7 more figures

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

TL;DR

Abstract

Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (12)