Table of Contents
Fetching ...

Benchmarking Distributional Alignment of Large Language Models

Nicole Meister, Carlos Guestrin, Tatsunori Hashimoto

TL;DR

This work constructs a dataset expanding beyond political values, creates human baselines for this task, and evaluates the extent to which an LM can align with a particular group's opinion distribution to inform design choices of such simulation systems.

Abstract

Language models (LMs) are increasingly used as simulacra for people, yet their ability to match the distribution of views of a specific demographic group and be \textit{distributionally aligned} remains uncertain. This notion of distributional alignment is complex, as there is significant variation in the types of attributes that are simulated. Prior works have underexplored the role of three critical variables -- the question domain, steering method, and distribution expression method -- which motivates our contribution of a benchmark explicitly addressing these dimensions. We construct a dataset expanding beyond political values, create human baselines for this task, and evaluate the extent to which an LM can align with a particular group's opinion distribution to inform design choices of such simulation systems. Our analysis reveals open problems regarding if, and how, LMs can be used to simulate humans, and that LLMs can more accurately describe the opinion distribution than simulate such distributions.

Benchmarking Distributional Alignment of Large Language Models

TL;DR

This work constructs a dataset expanding beyond political values, creates human baselines for this task, and evaluates the extent to which an LM can align with a particular group's opinion distribution to inform design choices of such simulation systems.

Abstract

Language models (LMs) are increasingly used as simulacra for people, yet their ability to match the distribution of views of a specific demographic group and be \textit{distributionally aligned} remains uncertain. This notion of distributional alignment is complex, as there is significant variation in the types of attributes that are simulated. Prior works have underexplored the role of three critical variables -- the question domain, steering method, and distribution expression method -- which motivates our contribution of a benchmark explicitly addressing these dimensions. We construct a dataset expanding beyond political values, create human baselines for this task, and evaluate the extent to which an LM can align with a particular group's opinion distribution to inform design choices of such simulation systems. Our analysis reveals open problems regarding if, and how, LMs can be used to simulate humans, and that LLMs can more accurately describe the opinion distribution than simulate such distributions.

Paper Structure

This paper contains 31 sections, 6 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Our work studies how variations in the dataset (yellow), steering method (green), and distributional expression method (purple) affect the quality of distributional alignment. We rank models and humans in their ability to align with the opinion distribution of demographic groups and find existing metrics for distributional alignment (i.e., model log-probabilities) systematically underestimate LM performance. While LMs may 'know' about distributional alignment, they struggle to sample from their own distribution.
  • Figure 2: Biased Coin Flip: We find that when the probability of heads is measured via model log-probabilities (left), the results are highly uncalibrated (this behavior is mitigated with temperature scaling (TS), shown in green). However, when the distribution is expressed through emitting a 30-token sequence of H or T (Sequence) or directly verbalizing the distributional knowledge (Verbalize Knowledge), we do not observe the same mis-calibration.
  • Figure 3: Top 4 books with the largest difference between Republican and Democrat ratings (left) and the top 4 books with the largest difference between Democrat and Republican ratings (right), with a 95% confidence interval from bootstrapping.
  • Figure 4: Steering Method and Dataset: We plot the average total variation for each dataset and steering method, averaged across demographic groups for the 30-token sequential distribution output. We find it is harder to steer models toward the dataset where opinions are hidden under a layer of abstraction (NYT). Additionally, few shot steering improves distributional alignment for humans and all models except for GPT-3.5.
  • Figure 5: Models assume Democrats read more than Republicans. In this plot, we show the marginal distribution of Likert Rating (1-4) in responses to the following question: "How likely are you to read this book?" A Likert rating of 1 refers to "Very unlikely" and a Likert rating of 4 refers to "Very likely". We averaged over 235 questions from NYT Book Opinions and 5 models steered towards Democrats and Republicans with persona steering (orange) and few shot steering (green). In blue, we plot the reference human reference for Democrat and Republican annotators. We find that persona steering produces more stereotypical results.
  • ...and 8 more figures