How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective
Chengpiao Huang, Yuhang Wu, Kaizheng Wang
TL;DR
The paper tackles how many synthetic responses from an LLM are needed to reliably infer population-level survey parameters when the LLM is imperfectly aligned with humans. It develops a data driven framework that adaptively selects a simulation size k to balance coverage and informativeness, and interprets the resulting hat k as an effective human-sample size κ that reflects the LLM’s fidelity via information-theoretic discrepancies. The authors prove average-case coverage guarantees for the constructed confidence sets, connect k to the LLM’s misalignment through χ2 and KL divergences, and illustrate heterogeneous fidelity across models and domains with real datasets OpinionQA and EEDI. Numerical experiments demonstrate that the method achieves near-target coverage, yields sensible interval widths, and provides a practical Fidelity metric κ̂ for comparing LLMs. The framework offers a principled, post hoc approach to leveraging LLM-simulated survey data for statistically valid inferences while highlighting the dangers of naively relying on large synthetic samples when fidelity is low.
Abstract
Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield excessively loose estimates. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous fidelity gaps across different LLMs and domains.
