The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective
George Gui, Olivier Toubia
TL;DR
This work identifies a fundamental pitfall in using large language models to simulate human counterfactuals: blinded prompts can induce unintended variation in covariates, violating unconfoundedness and biasing causal inferences. Through a demand-estimation study with 40 products, the authors show how price changes calibrate with other factors, producing endogeneity and implausible demand curves, and reveal a trade-off between covariate control and ecological validity (focalism). They formalize the ambiguity risk of prompting strategies and demonstrate that unblinding prompts—explicitly stating the experimental design and randomization—consistently improves predictive accuracy across models and remains beneficial when fine-tuning with human data or mixing in unrelated datasets. The findings advocate for unambiguous prompting as a core design principle in LLM simulations, with practical implications for research evaluation, prompt design, and model development. Overall, the paper offers a causal-inference framework and empirical guidance to enhance the reliability and validity of LLM-based behavioral simulations.
Abstract
Large Language Models (LLMs) have shown impressive potential to simulate human behavior. We identify a fundamental challenge in using them to simulate experiments: when LLM-simulated subjects are blind to the experimental design (as is standard practice with human subjects), variations in treatment systematically affect unspecified variables that should remain constant, violating the unconfoundedness assumption. Using demand estimation as a context and an actual experiment with 40 different products as a benchmark, we show this can lead to implausible results. While confounding may in principle be addressed by controlling for covariates, this can compromise ecological validity in the context of LLM simulations: controlled covariates become artificially salient in the simulated decision process. We show formally that confoundness stems from ambiguous prompting strategies. Therefore, it can be addressed by developing unambiguous prompting strategies through unblinding, i.e., revealing the experiment design in LLM simulations. Our empirical results show that this strategy consistently enhances model performance across all tested models, including both out-of-box reasoning and non-reasoning models. We also show that it is a technique that complements fine-tuning: while fine-tuning can improve simulation performance, an unambiguous prompting strategy makes the predictions robust to the inclusion of irrelevant data in the fine-tuning process.
