Table of Contents
Fetching ...

The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

George Gui, Olivier Toubia

TL;DR

This work identifies a fundamental pitfall in using large language models to simulate human counterfactuals: blinded prompts can induce unintended variation in covariates, violating unconfoundedness and biasing causal inferences. Through a demand-estimation study with 40 products, the authors show how price changes calibrate with other factors, producing endogeneity and implausible demand curves, and reveal a trade-off between covariate control and ecological validity (focalism). They formalize the ambiguity risk of prompting strategies and demonstrate that unblinding prompts—explicitly stating the experimental design and randomization—consistently improves predictive accuracy across models and remains beneficial when fine-tuning with human data or mixing in unrelated datasets. The findings advocate for unambiguous prompting as a core design principle in LLM simulations, with practical implications for research evaluation, prompt design, and model development. Overall, the paper offers a causal-inference framework and empirical guidance to enhance the reliability and validity of LLM-based behavioral simulations.

Abstract

Large Language Models (LLMs) have shown impressive potential to simulate human behavior. We identify a fundamental challenge in using them to simulate experiments: when LLM-simulated subjects are blind to the experimental design (as is standard practice with human subjects), variations in treatment systematically affect unspecified variables that should remain constant, violating the unconfoundedness assumption. Using demand estimation as a context and an actual experiment with 40 different products as a benchmark, we show this can lead to implausible results. While confounding may in principle be addressed by controlling for covariates, this can compromise ecological validity in the context of LLM simulations: controlled covariates become artificially salient in the simulated decision process. We show formally that confoundness stems from ambiguous prompting strategies. Therefore, it can be addressed by developing unambiguous prompting strategies through unblinding, i.e., revealing the experiment design in LLM simulations. Our empirical results show that this strategy consistently enhances model performance across all tested models, including both out-of-box reasoning and non-reasoning models. We also show that it is a technique that complements fine-tuning: while fine-tuning can improve simulation performance, an unambiguous prompting strategy makes the predictions robust to the inclusion of irrelevant data in the fine-tuning process.

The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

TL;DR

This work identifies a fundamental pitfall in using large language models to simulate human counterfactuals: blinded prompts can induce unintended variation in covariates, violating unconfoundedness and biasing causal inferences. Through a demand-estimation study with 40 products, the authors show how price changes calibrate with other factors, producing endogeneity and implausible demand curves, and reveal a trade-off between covariate control and ecological validity (focalism). They formalize the ambiguity risk of prompting strategies and demonstrate that unblinding prompts—explicitly stating the experimental design and randomization—consistently improves predictive accuracy across models and remains beneficial when fine-tuning with human data or mixing in unrelated datasets. The findings advocate for unambiguous prompting as a core design principle in LLM simulations, with practical implications for research evaluation, prompt design, and model development. Overall, the paper offers a causal-inference framework and empirical guidance to enhance the reliability and validity of LLM-based behavioral simulations.

Abstract

Large Language Models (LLMs) have shown impressive potential to simulate human behavior. We identify a fundamental challenge in using them to simulate experiments: when LLM-simulated subjects are blind to the experimental design (as is standard practice with human subjects), variations in treatment systematically affect unspecified variables that should remain constant, violating the unconfoundedness assumption. Using demand estimation as a context and an actual experiment with 40 different products as a benchmark, we show this can lead to implausible results. While confounding may in principle be addressed by controlling for covariates, this can compromise ecological validity in the context of LLM simulations: controlled covariates become artificially salient in the simulated decision process. We show formally that confoundness stems from ambiguous prompting strategies. Therefore, it can be addressed by developing unambiguous prompting strategies through unblinding, i.e., revealing the experiment design in LLM simulations. Our empirical results show that this strategy consistently enhances model performance across all tested models, including both out-of-box reasoning and non-reasoning models. We also show that it is a technique that complements fine-tuning: while fine-tuning can improve simulation performance, an unambiguous prompting strategy makes the predictions robust to the inclusion of irrelevant data in the fine-tuning process.
Paper Structure (17 sections, 1 theorem, 2 equations, 10 figures, 4 tables)

This paper contains 17 sections, 1 theorem, 2 equations, 10 figures, 4 tables.

Key Result

Theorem F.1

[Impossibility Theorem for Ambiguous Prompting Strategies] For a given ambiguous prompting strategy, there exists no LLM $f$ that can correctly answer all questions correctly.

Figures (10)

  • Figure 1: Unintended correlation between price and past price, competing price, and expiration days
  • Figure 2: Demand curve elicited from humans vs. LLM, averaged over 40 products
  • Figure 3: Demand curves elicited by LLM after controlling for demographic variables
  • Figure 4: Purchase probability, controlling for demographics and competing price: Coca-Cola example.
  • Figure 5: Mean absolute error progression as more covariates are controlled in the simulation. The MAE is calculated by comparing the purchase probability for each (product, price) combination.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 4.1: Prompting Strategy
  • Definition 4.2: Ambiguous Prompting Strategy
  • Theorem F.1
  • proof