Table of Contents
Fetching ...

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

Yonchanok Khaokaew, Flora D. Salim, Andreas Züfle, Hao Xue, Taylor Anderson, C. Raina MacIntyre, Matthew Scotch, David J Heslop

TL;DR

The paper addresses whether LLM-generated simulacra can faithfully model real healthcare decision-making and reveal demographic biases by comparing vaccination decisions produced under demographic-context prompts to UAS survey data across four pandemic phases using four open-source models and two bias metrics. It employs a four-context, zero-shot prompting framework and evaluates with DIR and JSD to quantify disparities and distributional differences. Findings show heterogeneous alignment across models: some track early uptake while exaggerating or dampening later skepticism, and several models obscure real-world demographic variation. The work highlights both the potential of generative agents for health behaviour research and the need for bias-aware prompting and model selection to avoid misrepresenting real-world patterns.

Abstract

Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

TL;DR

The paper addresses whether LLM-generated simulacra can faithfully model real healthcare decision-making and reveal demographic biases by comparing vaccination decisions produced under demographic-context prompts to UAS survey data across four pandemic phases using four open-source models and two bias metrics. It employs a four-context, zero-shot prompting framework and evaluates with DIR and JSD to quantify disparities and distributional differences. Findings show heterogeneous alignment across models: some track early uptake while exaggerating or dampening later skepticism, and several models obscure real-world demographic variation. The work highlights both the potential of generative agents for health behaviour research and the need for bias-aware prompting and model selection to avoid misrepresenting real-world patterns.

Abstract

Generative agents have been increasingly used to simulate human behaviour in silico, driven by large language models (LLMs). These simulacra serve as sandboxes for studying human behaviour without compromising privacy or safety. However, it remains unclear whether such agents can truly represent real individuals. This work compares survey data from the Understanding America Study (UAS) on healthcare decision-making with simulated responses from generative agents. Using demographic-based prompt engineering, we create digital twins of survey respondents and analyse how well different LLMs reproduce real-world behaviours. Our findings show that some LLMs fail to reflect realistic decision-making, such as predicting universal vaccine acceptance. However, Llama 3 captures variations across race and Income more accurately but also introduces biases not present in the UAS data. This study highlights the potential of generative agents for behavioural research while underscoring the risks of bias from both LLMs and prompting strategies.

Paper Structure

This paper contains 15 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the experimental setup
  • Figure 2: Comparison of survey and LLMs decision outputs 4 different situations
  • Figure 3: Racial Bias in Vaccine Decisions: LLM Outputs vs. Survey Data