The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

George Gui; Olivier Toubia

The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

George Gui, Olivier Toubia

TL;DR

This work identifies a fundamental pitfall in using large language models to simulate human counterfactuals: blinded prompts can induce unintended variation in covariates, violating unconfoundedness and biasing causal inferences. Through a demand-estimation study with 40 products, the authors show how price changes calibrate with other factors, producing endogeneity and implausible demand curves, and reveal a trade-off between covariate control and ecological validity (focalism). They formalize the ambiguity risk of prompting strategies and demonstrate that unblinding prompts—explicitly stating the experimental design and randomization—consistently improves predictive accuracy across models and remains beneficial when fine-tuning with human data or mixing in unrelated datasets. The findings advocate for unambiguous prompting as a core design principle in LLM simulations, with practical implications for research evaluation, prompt design, and model development. Overall, the paper offers a causal-inference framework and empirical guidance to enhance the reliability and validity of LLM-based behavioral simulations.

Abstract

Large Language Models (LLMs) have shown impressive potential to simulate human behavior. We identify a fundamental challenge in using them to simulate experiments: when LLM-simulated subjects are blind to the experimental design (as is standard practice with human subjects), variations in treatment systematically affect unspecified variables that should remain constant, violating the unconfoundedness assumption. Using demand estimation as a context and an actual experiment with 40 different products as a benchmark, we show this can lead to implausible results. While confounding may in principle be addressed by controlling for covariates, this can compromise ecological validity in the context of LLM simulations: controlled covariates become artificially salient in the simulated decision process. We show formally that confoundness stems from ambiguous prompting strategies. Therefore, it can be addressed by developing unambiguous prompting strategies through unblinding, i.e., revealing the experiment design in LLM simulations. Our empirical results show that this strategy consistently enhances model performance across all tested models, including both out-of-box reasoning and non-reasoning models. We also show that it is a technique that complements fine-tuning: while fine-tuning can improve simulation performance, an unambiguous prompting strategy makes the predictions robust to the inclusion of irrelevant data in the fine-tuning process.

The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

TL;DR

Abstract

Paper Structure (17 sections, 1 theorem, 2 equations, 10 figures, 4 tables)

This paper contains 17 sections, 1 theorem, 2 equations, 10 figures, 4 tables.

Introduction
Unintended Confounding in Blind LLM Simulations
Potential Solution: Controlling for Covariates
Core Issue: Ambiguous Prompting Strategy
The value of unblinding
Conclusion
Category, Product and Price Information
Simple prompts for eliciting correlation
Covariance of unspecified variables
Simulation that controls for demographic variables
Simulation that also controls for competing price
Non-monotonic impact of controlling for covariates
Theoretical Framework
Impossibility of Ambiguous Prompting
Illustration of the underlying DGP that LLM simulation mimics if prompts are interpreted differently
...and 2 more sections

Key Result

Theorem F.1

[Impossibility Theorem for Ambiguous Prompting Strategies] For a given ambiguous prompting strategy, there exists no LLM $f$ that can correctly answer all questions correctly.

Figures (10)

Figure 1: Unintended correlation between price and past price, competing price, and expiration days
Figure 2: Demand curve elicited from humans vs. LLM, averaged over 40 products
Figure 3: Demand curves elicited by LLM after controlling for demographic variables
Figure 4: Purchase probability, controlling for demographics and competing price: Coca-Cola example.
Figure 5: Mean absolute error progression as more covariates are controlled in the simulation. The MAE is calculated by comparing the purchase probability for each (product, price) combination.
...and 5 more figures

Theorems & Definitions (4)

Definition 4.1: Prompting Strategy
Definition 4.2: Ambiguous Prompting Strategy
Theorem F.1
proof

The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

TL;DR

Abstract

The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (4)