Questioning the Survey Responses of Large Language Models
Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner
TL;DR
This paper questions the validity of using multiple-choice surveys to infer demographic or value alignment for large language models by linking model prompt distributions to human census data. It introduces randomized choice ordering to separate labeling biases from genuine preferences, and analyzes 43 models across 25 ACS questions using entropy and KL divergence relative to census distributions. The key finding is that, even after adjustment, LLM responses remain near uniformly distributed and poorly resemble human populations, with alignment largely driven by the entropy of subgroups rather than model data or size. This casts doubt on survey-based alignment metrics and suggests the need for new evaluation approaches beyond traditional census-inspired prompts, even when tested across additional surveys like ATP and GAS/WVS.
Abstract
Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.
