Table of Contents
Fetching ...

Questioning the Survey Responses of Large Language Models

Ricardo Dominguez-Olmedo, Moritz Hardt, Celestine Mendler-Dünner

TL;DR

This paper questions the validity of using multiple-choice surveys to infer demographic or value alignment for large language models by linking model prompt distributions to human census data. It introduces randomized choice ordering to separate labeling biases from genuine preferences, and analyzes 43 models across 25 ACS questions using entropy and KL divergence relative to census distributions. The key finding is that, even after adjustment, LLM responses remain near uniformly distributed and poorly resemble human populations, with alignment largely driven by the entropy of subgroups rather than model data or size. This casts doubt on survey-based alignment metrics and suggests the need for new evaluation approaches beyond traditional census-inspired prompts, even when tested across additional surveys like ATP and GAS/WVS.

Abstract

Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.

Questioning the Survey Responses of Large Language Models

TL;DR

This paper questions the validity of using multiple-choice surveys to infer demographic or value alignment for large language models by linking model prompt distributions to human census data. It introduces randomized choice ordering to separate labeling biases from genuine preferences, and analyzes 43 models across 25 ACS questions using entropy and KL divergence relative to census distributions. The key finding is that, even after adjustment, LLM responses remain near uniformly distributed and poorly resemble human populations, with alignment largely driven by the entropy of subgroups rather than model data or size. This casts doubt on survey-based alignment metrics and suggests the need for new evaluation approaches beyond traditional census-inspired prompts, even when tested across additional surveys like ATP and GAS/WVS.

Abstract

Surveys have recently gained popularity as a tool to study large language models. By comparing survey responses of models to those of human reference populations, researchers aim to infer the demographics, political opinions, or values best represented by current language models. In this work, we critically examine this methodology on the basis of the well-established American Community Survey by the U.S. Census Bureau. Evaluating 43 different language models using de-facto standard prompting methodologies, we establish two dominant patterns. First, models' responses are governed by ordering and labeling biases, for example, towards survey responses labeled with the letter "A". Second, when adjusting for these systematic biases through randomized answer ordering, models across the board trend towards uniformly random survey responses, irrespective of model size or pre-training data. As a result, in contrast to conjectures from prior work, survey-derived alignment measures often permit a simple explanation: models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform for any survey under consideration.
Paper Structure (45 sections, 2 equations, 17 figures)

This paper contains 45 sections, 2 equations, 17 figures.

Figures (17)

  • Figure 1: We prompt language models with questions from the American Community Survey (ACS). We systematically compare models' survey responses to those of the U.S. Census.
  • Figure 2: Entropy of model responses across the ACS questions for naive prompting. Entropy of models' responses (◆) tends to increase log-linearly with model size, irrespective of the underlying response entropy observed in the U.S. census (--).
  • Figure 3: A-bias of in model responses across ACS questions. Each dot corresponds to one of the 25 questions. Models are ordered by size. As a reference, the extreme points illustrate A-bias for a model that always answers 'A' and a model that never answers 'A'. All models suffer from substantial A-bias.
  • Figure 4: Entropy of model responses after adjustment. (top) Illustration of how adjustment is performed. We average models' responses over all possible answer orderings. (bottom) Entropy of models' responses after adjustment. Entropy of base models' responses is close to 1 (i.e., uniform). Instruction tuned-models exhibit substantially higher variations in entropy across questions.
  • Figure 5: Divergence between adjusted model responses and different baselines: the overall U.S. census (), individual U.S. states (●), and a uniform baseline (★). Smaller means more similar. Model responses are by far more similar to the uniform baseline than to any human reference population.
  • ...and 12 more figures