Table of Contents
Fetching ...

When simulations look right but causal effects go wrong: Large language models as behavioral simulators

Zonghan Li, Feng Ji

Abstract

Behavioral simulation is increasingly used to anticipate responses to interventions. Large language models (LLMs) enable researchers to specify population characteristics and intervention context in natural language, but it remains unclear to what extent LLMs can use these inputs to infer intervention effects. We evaluated three LLMs on 11 climate-psychology interventions using a dataset of 59,508 participants from 62 countries, and replicated the main analysis in two additional datasets (12 and 27 countries). LLMs reproduced observed patterns in attitudinal outcomes (e.g., climate beliefs and policy support) reasonably well, and prompting refinements improved this descriptive fit. However, descriptive fit did not reliably translate into causal fidelity (i.e., accurate estimates of intervention effects), and these two dimensions of accuracy followed different error structures. This descriptive-causal divergence held across the three datasets, but varied across intervention logics, with larger errors for interventions that depended on evoking internal experience than on directly conveying reasons or social cues. It was more pronounced for behavioral outcomes, where LLMs imposed stronger attitude-behavior coupling than in human data. Countries and population groups appearing well captured descriptively were not necessarily those with lower causal errors. Relying on descriptive fit alone may therefore create unwarranted confidence in simulation results, misleading conclusions about intervention effects and masking population disparities that matter for fairness.

When simulations look right but causal effects go wrong: Large language models as behavioral simulators

Abstract

Behavioral simulation is increasingly used to anticipate responses to interventions. Large language models (LLMs) enable researchers to specify population characteristics and intervention context in natural language, but it remains unclear to what extent LLMs can use these inputs to infer intervention effects. We evaluated three LLMs on 11 climate-psychology interventions using a dataset of 59,508 participants from 62 countries, and replicated the main analysis in two additional datasets (12 and 27 countries). LLMs reproduced observed patterns in attitudinal outcomes (e.g., climate beliefs and policy support) reasonably well, and prompting refinements improved this descriptive fit. However, descriptive fit did not reliably translate into causal fidelity (i.e., accurate estimates of intervention effects), and these two dimensions of accuracy followed different error structures. This descriptive-causal divergence held across the three datasets, but varied across intervention logics, with larger errors for interventions that depended on evoking internal experience than on directly conveying reasons or social cues. It was more pronounced for behavioral outcomes, where LLMs imposed stronger attitude-behavior coupling than in human data. Countries and population groups appearing well captured descriptively were not necessarily those with lower causal errors. Relying on descriptive fit alone may therefore create unwarranted confidence in simulation results, misleading conclusions about intervention effects and masking population disparities that matter for fairness.

Paper Structure

This paper contains 16 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Roadmap of this study.
  • Figure 2: A better descriptive fit does not necessarily imply better causal fidelity.a. Mean error (ME) by model and outcome under baseline conditions. b. Mean absolute error (MAE) by model and outcome. c. Country-level Spearman rank correlation and KNN overlap. d. Ratio of predicted to observed variance across models and outcomes. e. Enhancement selection pipeline, in which five candidate techniques were compared on a development sample (n = 5,093). KNN-based few-shot prompting and VBN-based chain-of-thought prompting were selected. f. Enhancement effect on country-level mean error. g. Enhancement effect on individual-level MAE. h. Enhancement effect on country-level Spearman correlation and KNN overlap. i. Enhancement effect on Jensen-Shannon divergence by outcome. j. Mean absolute ATE error by outcome and prompting method for each LLM.
  • Figure 3: Descriptive fit and causal fidelity reveal distinct error structures across models and datasets. a. Outcome type dominates descriptive error, but causal error redistributes across method, country, and their interactions. b. The structural shift replicates in both additional datasets, although the dimensions involved vary with experimental design. c. Intraclass correlation coefficients (ICC) show that country-level heterogeneity is amplified in causal evaluation across all model-dataset combinations.
  • Figure 4: Causal error is structured by intervention logic, and LLM simulations impose stronger attitude-behavior coupling than observed in human. Results shown for GPT; other models in Supplementary Information. a. Situational simulation interventions show the largest causal overestimation, and cultural and group norms showed the smallest. Each point represents the signed ATE error of an intervention under baseline (red), few-shot (blue), or VBN-CoT (purple) prompting. Crosses indicate direction-flipped effects. Open circles indicate interventions whose human ATE 95% CI crosses zero. b. LLM simulations tighten the attitude-behavior coupling relative to human data. Lower triangles show individual-level response correlations. Upper triangles show intervention-level ATE correlations. c. Models are sensitive to persuasive force in intervention text for attitudinal outcomes, but action estimates do not respond coherently to the same perturbations. Bars show the change in absolute ATE error relative to the original intervention text under text perturbations. Error bars indicate 95% CI. d. Instructing models to decouple attitudes from behavior does not consistently reduce action ATE error. Corrections improve some interventions but overcorrect others.
  • Figure 5: Countries and population groups well captured descriptively are not necessarily those with accurate causal estimates. Results shown for GPT; other models in Supplementary Information. a. Descriptive and causal error are weakly correlated at the country level. Open markers indicate baseline, and filled markers indicate prompting refinements. b. Countries with highest and lowest descriptive error. c. Countries with highest and lowest causal error. Overlap between b and c is minimal. d-e. Country-level descriptive gains concentrate in OECD and high-internet countries, but these advantages do not carry over to causal fidelity. Error bars show 95% CI. f-g. Descriptive disparities are largest along age and gender, while causal disparities are largest along political orientation and SES. Error bars show 95% CI.
  • ...and 5 more figures