Table of Contents
Fetching ...

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

Chengpiao Huang, Yuhang Wu, Kaizheng Wang

TL;DR

The paper tackles how many synthetic responses from an LLM are needed to reliably infer population-level survey parameters when the LLM is imperfectly aligned with humans. It develops a data driven framework that adaptively selects a simulation size k to balance coverage and informativeness, and interprets the resulting hat k as an effective human-sample size κ that reflects the LLM’s fidelity via information-theoretic discrepancies. The authors prove average-case coverage guarantees for the constructed confidence sets, connect k to the LLM’s misalignment through χ2 and KL divergences, and illustrate heterogeneous fidelity across models and domains with real datasets OpinionQA and EEDI. Numerical experiments demonstrate that the method achieves near-target coverage, yields sensible interval widths, and provides a practical Fidelity metric κ̂ for comparing LLMs. The framework offers a principled, post hoc approach to leveraging LLM-simulated survey data for statistically valid inferences while highlighting the dangers of naively relying on large synthetic samples when fidelity is low.

Abstract

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield excessively loose estimates. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous fidelity gaps across different LLMs and domains.

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

TL;DR

The paper tackles how many synthetic responses from an LLM are needed to reliably infer population-level survey parameters when the LLM is imperfectly aligned with humans. It develops a data driven framework that adaptively selects a simulation size k to balance coverage and informativeness, and interprets the resulting hat k as an effective human-sample size κ that reflects the LLM’s fidelity via information-theoretic discrepancies. The authors prove average-case coverage guarantees for the constructed confidence sets, connect k to the LLM’s misalignment through χ2 and KL divergences, and illustrate heterogeneous fidelity across models and domains with real datasets OpinionQA and EEDI. Numerical experiments demonstrate that the method achieves near-target coverage, yields sensible interval widths, and provides a practical Fidelity metric κ̂ for comparing LLMs. The framework offers a principled, post hoc approach to leveraging LLM-simulated survey data for statistically valid inferences while highlighting the dangers of naively relying on large synthetic samples when fidelity is low.

Abstract

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, addressing the distribution shift between the simulated and real populations. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield excessively loose estimates. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous fidelity gaps across different LLMs and domains.

Paper Structure

This paper contains 82 sections, 23 theorems, 224 equations, 12 figures, 6 tables, 1 algorithm.

Key Result

Theorem 2.1

Let Assumptions ec:assumption-iid-test-1D and ec:assumption-indep-data-1D hold. Assume that $\mathbb{P}( \bar{y}_j \le \mu_{j} \mid \psi_j ) \in [ \frac{1}{2} - \eta, \frac{1}{2} + \eta ]$ for each $j\in[m]$, where $\eta\in[0,1/2)$. Fix $\alpha\in(0,1)$. Then the simulation sample size $\widehat{k}$ The probability is taken with respect to the randomness of $\{ ( \psi_j , \mathcal{D}_j, \mathcal{D

Figures (12)

  • Figure 1: An Interpretation of an LLM as Being Made Up of $\widehat{k}$ Real Human Agents. Generating an output from the LLM can be thought of as sampling a response from a human agent inside the LLM. The figure is generated by ChatGPT 5 GPT5, and borrows ideas from the Mechanical Turk, a chess-playing machine from the 18th century with a human player hidden inside.
  • Figure 2: The Coverage-Width Trade-off for the Simulation Sample Size $k$. The true mean is $\mu=0.4$ (red dashed line), and the synthetic distribution has a mean of $\mu^{\mathsf{syn}}=0.6$. The horizontal axis is the simulation sample size $k$. The blue curve plots the sample mean $\bar{y}^{\mathsf{syn}}_k$ of the synthetic data, and the blue shaded region visualizes the confidence interval $\mathcal{I}^{\mathsf{syn}}(k)$, for $k\in[40]$. For a small sample size ($k\le 6$), the interval is too wide. For a large sample size (say $k\ge 18$), the interval becomes too narrow and fails to cover $\mu$.
  • Figure 3: The LLM as a Mechanical Turk. The top panel illustrates the traditional survey process, where an individual is sampled from the real human population to provide a response. The bottom panel depicts our conceptual model, in which the LLM is modeled as containing a hidden pool of agents drawn from that same population. When the LLM is queried, its response is generated by one of these internal agents. This figure is created based on images generated by ChatGPT 5 GPT5.
  • Figure 4: Parametric Bootstrap.
  • Figure 5: Our Framework.
  • ...and 7 more figures

Theorems & Definitions (50)

  • Theorem 2.1: Coverage guarantee
  • proof : Proof of \ref{['ec:thm-coverage-1D']}
  • Example 3.1: Public opinion survey
  • Example 3.2: Sentiment in opinion survey
  • Example 3.3: Market research
  • Theorem 3.1: Coverage guarantee
  • proof : Proof of \ref{['ec:thm-coverage']}
  • Definition 4.1: Quantile
  • Theorem 4.1: Information-theoretic characterization of $\kappa$
  • proof : Proof of \ref{['ec:thm-kappa-IT-chi']}
  • ...and 40 more