Table of Contents
Fetching ...

Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents

Stephen Pilli, Vivek Nallur

TL;DR

This work investigates whether large language models can emulate biased human decision-making at the individual level within conversational contexts. Using a human experiment (N=$1100$) across three classic decision scenarios and two prior-dialogue complexities, the study benchmarks status quo bias in chatbot-assisted decisions and then tests LLMs (GPT-4/5) trained with demographic cues and prior dialogue transcripts to predict human biases. Results show robust status quo effects in humans for Budget Allocation and College Jobs, modest amplification under cognitive load, and mixed replication by LLM agents depending on prompting; HL3 prompts yield stronger but sometimes misleading bias alignment. The findings demonstrate the potential and limitations of LLM-based behavioral simulations for bias-aware interactive systems and emphasize the need for careful prompt design and validation when modeling individual-level decision dynamics.

Abstract

Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors, such as cognitive load, interact with these biases. We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment (N=1100). Participants engaged with a chatbot that facilitates decision-making through simple or complex dialogues. Results revealed robust biases. To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5. The LLMs reproduced human biases with precision. We found notable differences between models in how they aligned human behavior. This has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems in interactive contexts.

Emulating Aggregate Human Choice Behavior and Biases with GPT Conversational Agents

TL;DR

This work investigates whether large language models can emulate biased human decision-making at the individual level within conversational contexts. Using a human experiment (N=) across three classic decision scenarios and two prior-dialogue complexities, the study benchmarks status quo bias in chatbot-assisted decisions and then tests LLMs (GPT-4/5) trained with demographic cues and prior dialogue transcripts to predict human biases. Results show robust status quo effects in humans for Budget Allocation and College Jobs, modest amplification under cognitive load, and mixed replication by LLM agents depending on prompting; HL3 prompts yield stronger but sometimes misleading bias alignment. The findings demonstrate the potential and limitations of LLM-based behavioral simulations for bias-aware interactive systems and emphasize the need for careful prompt design and validation when modeling individual-level decision dynamics.

Abstract

Cognitive biases often shape human decisions. While large language models (LLMs) have been shown to reproduce well-known biases, a more critical question is whether LLMs can predict biases at the individual level and emulate the dynamics of biased human behavior when contextual factors, such as cognitive load, interact with these biases. We adapted three well-established decision scenarios into a conversational setting and conducted a human experiment (N=1100). Participants engaged with a chatbot that facilitates decision-making through simple or complex dialogues. Results revealed robust biases. To evaluate how LLMs emulate human decision-making under similar interactive conditions, we used participant demographics and dialogue transcripts to simulate these conditions with LLMs based on GPT-4 and GPT-5. The LLMs reproduced human biases with precision. We found notable differences between models in how they aligned human behavior. This has important implications for designing and evaluating adaptive, bias-aware LLM-based AI systems in interactive contexts.
Paper Structure (85 sections, 2 equations, 5 figures, 21 tables)

This paper contains 85 sections, 2 equations, 5 figures, 21 tables.

Figures (5)

  • Figure 1: Overview of the experimental procedure and design. Task abbreviations: IDM — Investment Decision-Making, BA — Budget Allocation, CJ — College Jobs. IV1 and IV2 denote independent variables; DV indicates the dependent variable.
  • Figure 2: NASA-TLX scores show significantly higher perceived Mental demand and Effort under the Complex Dialogue condition, confirming the effectiveness of the cognitive load manipulation.
  • Figure 3: Scatter-plots with regression lines showing associations between Mental Demand and Memory Task Accuracy (left), Response Time and Memory Task Accuracy (center), and Response Time and Mental Demand (right), the first two under Load condition. Shaded bands represent 95% confidence intervals.
  • Figure 4: Status quo: Correlation Between Response Length (Chars) and Response Times (s). The bold line is the regression line. The dotted line is the average human typing speed (260 characters per minute).
  • Figure 5: Forest plot for effect sizes and 95% confidence intervals for all models compared with Human experiments.