Table of Contents
Fetching ...

SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models

Jia Wang, Ziyu Zhao, Tingjuntao Ni, Zhongyu Wei

TL;DR

SocioBench introduces a cross-cultural benchmark derived from ISSP survey data to evaluate how well large language models simulate real-world social attitudes. It utilizes demographic-conditioned role-playing prompts to generate ground-truth-aligned responses across 10 sociological domains and 30+ countries, enabling an accuracy-based evaluation. The study finds that state-of-the-art LLMs achieve about 30–40% accuracy on these tasks, with performance sensitive to model size, domain, and demographic subgroups, and reveals biases in option distributions and modest benefits from reasoning prompts. This benchmark provides a scalable, real-world benchmark for assessing alignment of LLMs with sociological attitudes and highlights substantial gaps in current models that limit their utility for survey-style social science research.

Abstract

Large language models (LLMs) show strong potential for simulating human social behaviors and interactions, yet lack large-scale, systematically constructed benchmarks for evaluating their alignment with real-world social attitudes. To bridge this gap, we introduce SocioBench-a comprehensive benchmark derived from the annually collected, standardized survey data of the International Social Survey Programme (ISSP). The benchmark aggregates over 480,000 real respondent records from more than 30 countries, spanning 10 sociological domains and over 40 demographic attributes. Our experiments indicate that LLMs achieve only 30-40% accuracy when simulating individuals in complex survey scenarios, with statistically significant differences across domains and demographic subgroups. These findings highlight several limitations of current LLMs in survey scenarios, including insufficient individual-level data coverage, inadequate scenario diversity, and missing group-level modeling.

SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models

TL;DR

SocioBench introduces a cross-cultural benchmark derived from ISSP survey data to evaluate how well large language models simulate real-world social attitudes. It utilizes demographic-conditioned role-playing prompts to generate ground-truth-aligned responses across 10 sociological domains and 30+ countries, enabling an accuracy-based evaluation. The study finds that state-of-the-art LLMs achieve about 30–40% accuracy on these tasks, with performance sensitive to model size, domain, and demographic subgroups, and reveals biases in option distributions and modest benefits from reasoning prompts. This benchmark provides a scalable, real-world benchmark for assessing alignment of LLMs with sociological attitudes and highlights substantial gaps in current models that limit their utility for survey-style social science research.

Abstract

Large language models (LLMs) show strong potential for simulating human social behaviors and interactions, yet lack large-scale, systematically constructed benchmarks for evaluating their alignment with real-world social attitudes. To bridge this gap, we introduce SocioBench-a comprehensive benchmark derived from the annually collected, standardized survey data of the International Social Survey Programme (ISSP). The benchmark aggregates over 480,000 real respondent records from more than 30 countries, spanning 10 sociological domains and over 40 demographic attributes. Our experiments indicate that LLMs achieve only 30-40% accuracy when simulating individuals in complex survey scenarios, with statistically significant differences across domains and demographic subgroups. These findings highlight several limitations of current LLMs in survey scenarios, including insufficient individual-level data coverage, inadequate scenario diversity, and missing group-level modeling.

Paper Structure

This paper contains 34 sections, 1 equation, 14 figures, 15 tables.

Figures (14)

  • Figure 1: Overview of SocioBench. We first constructed the questionnaire question-answering dataset covering the ten sociological domains of the ISSP, along with the dataset containing ground-truth demographic labels and respondent answers. We then instructed the LLM to answer the survey conditioned on the demographic labels, and evaluated model performance by computing the accuracy between the LLM's responses and the ground-truth answers.
  • Figure 2: Experimental Results and Significance Analysis of Representative LLMs in the Cross-Continental Subgroup.
  • Figure 3: Question and Answer Option Distribution Analysis across ISSP Survey Domains. (a) shows the distribution of answer options per question across domains using violin plots. The width of each violin represents the density of questions with that number of options. The red line indicates the mean number of options, while the dark red line shows the median number of options for each domain. The black lines represent the data range (minimum to maximum values). (b) displays the overall distribution of questions grouped by answer option count across the entire dataset, showing how many questions have 2, 3, 4, 5, etc. answer options in total.
  • Figure 4: SocioBench Dataset: Questions and answers in social survey questionnaires
  • Figure 5: SocioBench Dataset: respondent demographic information and Ground-truth answers
  • ...and 9 more figures