Table of Contents
Fetching ...

Do LLMs exhibit human-like response biases? A case study in survey design

Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig

TL;DR

This work addresses whether LLMs can serve as faithful proxies for human response biases in survey design. It develops a bias-aware evaluation framework using original and modified questions derived from the American Trends Panel, evaluating 9 models across 5 biases and 3 non-bias perturbations, with 50 samples per condition. The study finds that LLMs generally fail to reproduce human bias patterns; RLHF-trained models show reduced bias sensitivity but increased sensitivity to perturbations, and replicating human opinion distributions does not guarantee human-like bias behavior. The results urge caution in using LLMs as human surrogates for surveys and advocate for more nuanced, multi-metric evaluations to characterize model behavior.

Abstract

As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording - but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of "prompts" have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/BiasMonkey

Do LLMs exhibit human-like response biases? A case study in survey design

TL;DR

This work addresses whether LLMs can serve as faithful proxies for human response biases in survey design. It develops a bias-aware evaluation framework using original and modified questions derived from the American Trends Panel, evaluating 9 models across 5 biases and 3 non-bias perturbations, with 50 samples per condition. The study finds that LLMs generally fail to reproduce human bias patterns; RLHF-trained models show reduced bias sensitivity but increased sensitivity to perturbations, and replicating human opinion distributions does not guarantee human-like bias behavior. The results urge caution in using LLMs as human surrogates for surveys and advocate for more nuanced, multi-metric evaluations to characterize model behavior.

Abstract

As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording - but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of "prompts" have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/BiasMonkey
Paper Structure (19 sections, 1 equation, 3 figures, 11 tables)

This paper contains 19 sections, 1 equation, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Our evaluation framework consists of three steps: (1) generating a dataset of original and modified questions given a response bias of interest, (2) collecting LLM responses, and (3) evaluating whether the change in the distribution of LLM responses aligns with known trends about human behavior. We directly apply the same workflow to evaluate LLM behavior on non-bias perturbations (i.e., question modifications that have been shown to not elicit a change in response in humans).
  • Figure 2: We compare LLMs' behavior on bias types ($\Bar{\Delta}_{\text{b}}$) with their respective behavior on the set of perturbations ($\Bar{\Delta}_{\text{p}}$). We color cells that have statistically significant changes by the directionality of $\Bar{\Delta}_{\text{b}}$ ( foocpos blue indicates a positive effect and foocneg orange indicates a negative effect), using $p=0.05$ cut-off, and use hatched cells to indicate non-significant changes. A full table with $\Bar{\Delta}_{\text{b}}$ and $\Bar{\Delta}_{\text{p}}$ values and p-values is in Table \ref{['tab:full_results']}. While we would ideally observe that models are only responsive to the bias modifications and are not responsive to the other perturbations, as shown in the top-right the "most human-like" depiction, the results do not generally reflect the ideal setting.
  • Figure 3: Representativeness is a metric based on the Wasserstein distance which measures the extent to which each model reflects the opinions of a population, in this case Pew U.S. survey respondents (the higher the better) santurkar2023whose. Colors indicate model groupings, with red for the Llama2 base models, green for Solar (instruction fine-tuned Llama2 70b), blue for Llama2 chat models, and purple for GPT 3.5.