Do LLMs exhibit human-like response biases? A case study in survey design
Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig
TL;DR
This work addresses whether LLMs can serve as faithful proxies for human response biases in survey design. It develops a bias-aware evaluation framework using original and modified questions derived from the American Trends Panel, evaluating 9 models across 5 biases and 3 non-bias perturbations, with 50 samples per condition. The study finds that LLMs generally fail to reproduce human bias patterns; RLHF-trained models show reduced bias sensitivity but increased sensitivity to perturbations, and replicating human opinion distributions does not guarantee human-like bias behavior. The results urge caution in using LLMs as human surrogates for surveys and advocate for more nuanced, multi-metric evaluations to characterize model behavior.
Abstract
As large language models (LLMs) become more capable, there is growing excitement about the possibility of using LLMs as proxies for humans in real-world tasks where subjective labels are desired, such as in surveys and opinion polling. One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording - but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of "prompts" have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior. Our code, dataset, and collected samples are available at https://github.com/lindiatjuatja/BiasMonkey
