This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
Jessica Hullman, David Broska, Huaman Sun, Aaron Shaw
TL;DR
This paper articulates a principled framework for using LLM simulations in behavioral research, distinguishing heuristic validation from statistical calibration. It formalizes two essential conditions for valid substitution of LLM outputs for human data: No Training Leakage and preservation of moment conditions that identify target parameters, drawing on Ludwig et al. to define when simple substitution is defensible. It then surveys calibration-based methods, notably PPI and plug-in bias corrections, which can yield unbiased and more precise inferences under explicit assumptions, while acknowledging that gains depend on LLM accuracy and data availability. Beyond substitution, the work highlights the role of LLMs in exploratory design, causal discovery, stress-testing, and mechanism hypothesizing, arguing for careful, transparent use that leverages LLMs to enhance theory and design without compromising confirmatory standards. Overall, the paper offers a practical, assumption-driven roadmap for integrating LLM simulations into behavioral science with calibrated inferences and clearly delineated exploratory opportunities.
Abstract
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
