Table of Contents
Fetching ...

Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

Runze Zhang, Xiaowei Zhang, Mingyang Zhao

TL;DR

This study evaluates whether large language models (LLMs) can serve as faithful human behavior simulators in operations management by aligning LLM outputs with nine behavioral OM experiments. The authors use two criteria: replication of hypothesis‑test outcomes and distributional alignment via the Wasserstein distance $W$, finding that LLMs largely reproduce behavioral effects but diverge in distributional shape. They show that lightweight strategies—Chain‑of‑Thought prompting and hyperparameter tuning, especially temperature and sampling parameters—substantially reduce $W$ and, in some cases, allow smaller/open‑source models to match or surpass larger systems. The work provides a reproducible framework for evaluating LLMs as behavioral surrogates in OM and highlights the need for careful tuning and benchmarking when distributional fidelity matters for policy evaluation and system design.

Abstract

LLMs are emerging tools for simulating human behavior in business, economics, and social science, offering a lower-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how well LLMs replicate human behavior in operations management. Using nine published experiments in behavioral operations, we assess two criteria: replication of hypothesis-test outcomes and distributional alignment via Wasserstein distance. LLMs reproduce most hypothesis-level effects, capturing key decision biases, but their response distributions diverge from human data, including for strong commercial models. We also test two lightweight interventions -- chain-of-thought prompting and hyperparameter tuning -- which reduce misalignment and can sometimes let smaller or open-source models match or surpass larger systems.

Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

TL;DR

This study evaluates whether large language models (LLMs) can serve as faithful human behavior simulators in operations management by aligning LLM outputs with nine behavioral OM experiments. The authors use two criteria: replication of hypothesis‑test outcomes and distributional alignment via the Wasserstein distance , finding that LLMs largely reproduce behavioral effects but diverge in distributional shape. They show that lightweight strategies—Chain‑of‑Thought prompting and hyperparameter tuning, especially temperature and sampling parameters—substantially reduce and, in some cases, allow smaller/open‑source models to match or surpass larger systems. The work provides a reproducible framework for evaluating LLMs as behavioral surrogates in OM and highlights the need for careful tuning and benchmarking when distributional fidelity matters for policy evaluation and system design.

Abstract

LLMs are emerging tools for simulating human behavior in business, economics, and social science, offering a lower-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how well LLMs replicate human behavior in operations management. Using nine published experiments in behavioral operations, we assess two criteria: replication of hypothesis-test outcomes and distributional alignment via Wasserstein distance. LLMs reproduce most hypothesis-level effects, capturing key decision biases, but their response distributions diverge from human data, including for strong commercial models. We also test two lightweight interventions -- chain-of-thought prompting and hyperparameter tuning -- which reduce misalignment and can sometimes let smaller or open-source models match or surpass larger systems.

Paper Structure

This paper contains 37 sections, 3 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Prompt to Simulate Human Responses in the Experiment of doi:10.1287/mnsc.1120.1638
  • Figure 2: Uninformed LLM Consumers’ Purchase Rates by Waiting Time
  • Figure 3: Wasserstein Distance with Different LLMs
  • Figure 4: Distributions of LLM and Human Responses in the Experiment of doi:10.1287/mnsc.46.3.404.12070
  • Figure 5: Chain-of-Thought Improvement with Llama‑70B
  • ...and 17 more figures