Table of Contents
Fetching ...

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

Jessica Hullman, David Broska, Huaman Sun, Aaron Shaw

TL;DR

This paper articulates a principled framework for using LLM simulations in behavioral research, distinguishing heuristic validation from statistical calibration. It formalizes two essential conditions for valid substitution of LLM outputs for human data: No Training Leakage and preservation of moment conditions that identify target parameters, drawing on Ludwig et al. to define when simple substitution is defensible. It then surveys calibration-based methods, notably PPI and plug-in bias corrections, which can yield unbiased and more precise inferences under explicit assumptions, while acknowledging that gains depend on LLM accuracy and data availability. Beyond substitution, the work highlights the role of LLMs in exploratory design, causal discovery, stress-testing, and mechanism hypothesizing, arguing for careful, transparent use that leverages LLMs to enhance theory and design without compromising confirmatory standards. Overall, the paper offers a practical, assumption-driven roadmap for integrating LLM simulations into behavioral science with calibrated inferences and clearly delineated exploratory opportunities.

Abstract

A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.

This human study did not involve human subjects: Validating LLM simulations as behavioral evidence

TL;DR

This paper articulates a principled framework for using LLM simulations in behavioral research, distinguishing heuristic validation from statistical calibration. It formalizes two essential conditions for valid substitution of LLM outputs for human data: No Training Leakage and preservation of moment conditions that identify target parameters, drawing on Ludwig et al. to define when simple substitution is defensible. It then surveys calibration-based methods, notably PPI and plug-in bias corrections, which can yield unbiased and more precise inferences under explicit assumptions, while acknowledging that gains depend on LLM accuracy and data availability. Beyond substitution, the work highlights the role of LLMs in exploratory design, causal discovery, stress-testing, and mechanism hypothesizing, arguing for careful, transparent use that leverages LLMs to enhance theory and design without compromising confirmatory standards. Overall, the paper offers a practical, assumption-driven roadmap for integrating LLM simulations into behavioral science with calibrated inferences and clearly delineated exploratory opportunities.

Abstract

A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
Paper Structure (37 sections, 29 equations, 3 figures)

This paper contains 37 sections, 29 equations, 3 figures.

Figures (3)

  • Figure 1: A validate-then-simulate approach involves collecting some human observations, which are jointly labeled by the LLM ($D_{\text{shared}}$), and using this dataset to demonstrate that the model achieves sufficient fidelity in approximating important aspects of the human results. Heuristic approaches then imply or assert that estimates ($\hat{\theta}_{LLM}$) derived from a larger dataset where only LLM labels are available ($D_{\text{LLM}}$) may serve as proxies for human-derived estimates ($\hat{\theta}_H$) without statistical adjustment.
  • Figure 2: Statistical calibration approaches derive estimators that explicitly account for bias contributed by LLM approximations to human responses. PPI angelopoulos2023prediction, DSL egami2024using, and related approaches learn a rectifier and additively combine it with a base estimator. Approaches to plug-in correction learn a model of the relationship between human ground truth and LLM predictions using jointly labeled data ($D_{\text{shared}}$), then correct either the LLM predictions prior to estimation ludwig2025large or directly adjust the target inference wang2020methods.
  • Figure 3: In a simulate-then-validate approach, the researcher uses LLMs exclusively for exploratory piloting to find hypotheses with support, then conducts a human study on those that appear most promising.