Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses
Jens Rupprecht, Georg Ahnert, Markus Strohmaier
TL;DR
The paper addresses the reliability of large language models as proxies for human survey respondents by systematically perturbing both questions and answer options in World Values Survey items and evaluating nine LLMs across 167,400 interviews. Using metrics like $D_{KL}$ divergence and entropy, it demonstrates that larger models generally exhibit greater robustness but remain vulnerable to semantic perturbations and multi-perturbation interactions, with a pervasive recency bias toward the last presented option. The study also reveals human-like biases such as recency effects, context-dependent central tendency, and variable priming responses across models, highlighting the importance of prompt design and robustness checks when generating synthetic survey data. Practically, the results inform recommended practices for model selection, scale design, and prompt strategies to improve reliability and interpretability of synthetic survey outputs, and it provides a perturbation framework and dataset for ongoing benchmarking.
Abstract
Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts, we test nine LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated survey interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
