Table of Contents
Fetching ...

Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses

Jens Rupprecht, Georg Ahnert, Markus Strohmaier

TL;DR

The paper addresses the reliability of large language models as proxies for human survey respondents by systematically perturbing both questions and answer options in World Values Survey items and evaluating nine LLMs across 167,400 interviews. Using metrics like $D_{KL}$ divergence and entropy, it demonstrates that larger models generally exhibit greater robustness but remain vulnerable to semantic perturbations and multi-perturbation interactions, with a pervasive recency bias toward the last presented option. The study also reveals human-like biases such as recency effects, context-dependent central tendency, and variable priming responses across models, highlighting the importance of prompt design and robustness checks when generating synthetic survey data. Practically, the results inform recommended practices for model selection, scale design, and prompt strategies to improve reliability and interpretability of synthetic survey outputs, and it provides a perturbation framework and dataset for ongoing benchmarking.

Abstract

Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts, we test nine LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated survey interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses

TL;DR

The paper addresses the reliability of large language models as proxies for human survey respondents by systematically perturbing both questions and answer options in World Values Survey items and evaluating nine LLMs across 167,400 interviews. Using metrics like divergence and entropy, it demonstrates that larger models generally exhibit greater robustness but remain vulnerable to semantic perturbations and multi-perturbation interactions, with a pervasive recency bias toward the last presented option. The study also reveals human-like biases such as recency effects, context-dependent central tendency, and variable priming responses across models, highlighting the importance of prompt design and robustness checks when generating synthetic survey data. Practically, the results inform recommended practices for model selection, scale design, and prompt strategies to improve reliability and interpretability of synthetic survey outputs, and it provides a perturbation framework and dataset for ongoing benchmarking.

Abstract

Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts, we test nine LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated survey interviews. In doing so, we not only reveal LLMs' vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.

Paper Structure

This paper contains 40 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: The Interview Process. The figure displays an example of a answer option perturbation (a bias perturbation, e.g. reversed option order) and an question perturbation (a non-bias perturbation, e.g. typos in the question). Each model is prompted 25 times with every perturbation as well as the original Q&A phrasing. All responses are collected, processed and statistically analyzed.
  • Figure 2: Effect of Prompt Perturbations on Response Robustness Each cell represents the share of fully robust responses KL divergence = 0 by model and perturbation type across all 62 questions. Larger models (Llama-70B, Gemini) are substantially more robust than the smallest models (Llama-1B, Llama-3B). Specific perturbations such as odd versus even scales and reversing the answer options, are more challenging for all models than a missing refusal category or an additional priming suffix. The response robustness to the Non-Bias Perturbations are reported in Figure \ref{['fig:kl_share_nonbias_divergence_heatmaps']}.
  • Figure 3: Evidence of recency bias across all models. The bars show the frequency of choosing the same answer option (e.g., "Very important") when it is presented first vs. last. All models are significantly more likely to select an option when it appears at the end of the list.
  • Figure 4: Share of fully robust responses (KL divergence = 0) by model and perturbation type. Larger models (Llama-70B, Gemini) are substantially more robust than the smallest models (Llama-1B, Llama-3B).
  • Figure 5: The values display the difference in mean distance of the perturbed, (a) without refusal category and (b) with middle category to the scale center. For original even scales an artificial middle category is created and vice versa to be able to compare even and odd scales with one another for every question. Thus, in an original 5-pt Likert scale the middle category is removed, whereas in a 4-pt Likert scale a middle category is added. The difference in shift of the mean response to the center is consistent across all LLMs. No changes are removed for better readability.
  • ...and 6 more figures