Table of Contents
Fetching ...

ChatGPT vs Social Surveys: Probing Objective and Subjective Silicon Population

Muzhi Zhou, Lu Yu, Xiaomin Geng, Lan Luo

TL;DR

This paper interrogates whether GPT-generated responses can meaningfully reflect real-world population characteristics by constructing sampling distributions from repeated GPT samples and benchmarking against the 2020 US Census and the World Values Survey. Using GPT-3.5-turbo (with checks on GPT-4) and a central CLT-based framework, the authors examine both objective sociodemographics and subjective attitudes toward income inequality and gender roles. They find partial alignment for some demographics (e.g., gender and mean age) but clear biases in race, education, and income distributions, as well as highly deterministic, normally distributed attitudinal outputs that diverge from human responses. The study highlights a knowledge–performance gap where GPT can articulate distributions it has learned but cannot reliably sample to match those distributions, raising cautions about using LLM-generated data as proxies for real populations and suggesting hybrid approaches that integrate traditional surveys with LLM data. Overall, the work provides a rigorous, sampling-theoretic lens to assess the feasibility and limitations of silicon-population inference for social science research.

Abstract

Recent discussions about Large Language Models (LLMs) indicate that they have the potential to simulate human responses in social surveys and generate reliable predictions, such as those found in political polls. However, the existing findings are highly inconsistent, leaving us uncertain about the population characteristics of data generated by LLMs. In this paper, we employ repeated random sampling to create sampling distributions that identify the population parameters of silicon samples generated by GPT. Our findings show that GPT's demographic distribution aligns with the 2020 U.S. population in terms of gender and average age. However, GPT significantly overestimates the representation of the Black population and individuals with higher levels of education, even when it possesses accurate knowledge. Furthermore, GPT's point estimates for attitudinal scores are highly inconsistent and show no clear inclination toward any particular ideology. The sample response distributions exhibit a normal pattern that diverges significantly from those of human respondents. Consistent with previous studies, we find that GPT's answers are more deterministic than those of humans. We conclude by discussing the concerning implications of this biased and deterministic silicon population for making inferences about real-world populations.

ChatGPT vs Social Surveys: Probing Objective and Subjective Silicon Population

TL;DR

This paper interrogates whether GPT-generated responses can meaningfully reflect real-world population characteristics by constructing sampling distributions from repeated GPT samples and benchmarking against the 2020 US Census and the World Values Survey. Using GPT-3.5-turbo (with checks on GPT-4) and a central CLT-based framework, the authors examine both objective sociodemographics and subjective attitudes toward income inequality and gender roles. They find partial alignment for some demographics (e.g., gender and mean age) but clear biases in race, education, and income distributions, as well as highly deterministic, normally distributed attitudinal outputs that diverge from human responses. The study highlights a knowledge–performance gap where GPT can articulate distributions it has learned but cannot reliably sample to match those distributions, raising cautions about using LLM-generated data as proxies for real populations and suggesting hybrid approaches that integrate traditional surveys with LLM data. Overall, the work provides a rigorous, sampling-theoretic lens to assess the feasibility and limitations of silicon-population inference for social science research.

Abstract

Recent discussions about Large Language Models (LLMs) indicate that they have the potential to simulate human responses in social surveys and generate reliable predictions, such as those found in political polls. However, the existing findings are highly inconsistent, leaving us uncertain about the population characteristics of data generated by LLMs. In this paper, we employ repeated random sampling to create sampling distributions that identify the population parameters of silicon samples generated by GPT. Our findings show that GPT's demographic distribution aligns with the 2020 U.S. population in terms of gender and average age. However, GPT significantly overestimates the representation of the Black population and individuals with higher levels of education, even when it possesses accurate knowledge. Furthermore, GPT's point estimates for attitudinal scores are highly inconsistent and show no clear inclination toward any particular ideology. The sample response distributions exhibit a normal pattern that diverges significantly from those of human respondents. Consistent with previous studies, we find that GPT's answers are more deterministic than those of humans. We conclude by discussing the concerning implications of this biased and deterministic silicon population for making inferences about real-world populations.
Paper Structure (28 sections, 17 figures, 5 tables)

This paper contains 28 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Experiment flow of the two studies
  • Figure 2: Sampling distribution of gender, age, and region
  • Figure 3: The sampling distribution of racial groups
  • Figure 4: The sampling distribution of education groups
  • Figure 5: The sampling distribution of income groups
  • ...and 12 more figures