Table of Contents
Fetching ...

Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification

Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, Kristina Gligorić

TL;DR

<3-5 sentence high-level summary>

Abstract

Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24-86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.

Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification

TL;DR

<3-5 sentence high-level summary>

Abstract

Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24-86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.

Paper Structure

This paper contains 37 sections, 14 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Evaluation setup: Overview of synthesis and rectification. Given a small human dataset $(X^n, Y^n)$ and a disjoint, large demographic only dataset $X^N$, Synthesis produces responses $\hat{Y}^n, \hat{Y}^N$ using either prompting or fine-tuning. Rectification then combines model predictions $\hat{Y}^n$ with human responses $Y^n$ to compute a correction term. Then, this term is combined with $\hat{Y}^N$ to produce a final estimate $\hat{\theta}$ of the target $\theta^*$, with corresponding confidence intervals.
  • Figure 2: Fine-tuning and rectification interaction analysis. Results are averaged over 100 independent runs. (a) Bias vs. FT allocation with confidence bands; 20% allocation minimizes bias (red star). (b) Efficiency peaks at 80% FT allocation but with high uncertainty. (c) Pareto frontier with 95% confidence ellipses where points toward the upper left reflect better trade-offs. (d) Comparison across allocation policies.
  • Figure 3: Individual variation in energy intake within identical demographic groups. To the LLM, each point represents an individual with the same observable characteristics (age 66–70, female, 60–70kg, Non-Hispanic White, no special diet), yet their actual energy consumption varies greatly from 500 to 2,500 kcal.
  • Figure 4: ESS gain under $\text{Rec}_{\lambda_{\text{opt}}}$ across labeled-sample sizes $n_{\text{human}}\in\{50,100,150,200\}$.