SocioBench: Modeling Human Behavior in Sociological Surveys with Large Language Models
Jia Wang, Ziyu Zhao, Tingjuntao Ni, Zhongyu Wei
TL;DR
SocioBench introduces a cross-cultural benchmark derived from ISSP survey data to evaluate how well large language models simulate real-world social attitudes. It utilizes demographic-conditioned role-playing prompts to generate ground-truth-aligned responses across 10 sociological domains and 30+ countries, enabling an accuracy-based evaluation. The study finds that state-of-the-art LLMs achieve about 30–40% accuracy on these tasks, with performance sensitive to model size, domain, and demographic subgroups, and reveals biases in option distributions and modest benefits from reasoning prompts. This benchmark provides a scalable, real-world benchmark for assessing alignment of LLMs with sociological attitudes and highlights substantial gaps in current models that limit their utility for survey-style social science research.
Abstract
Large language models (LLMs) show strong potential for simulating human social behaviors and interactions, yet lack large-scale, systematically constructed benchmarks for evaluating their alignment with real-world social attitudes. To bridge this gap, we introduce SocioBench-a comprehensive benchmark derived from the annually collected, standardized survey data of the International Social Survey Programme (ISSP). The benchmark aggregates over 480,000 real respondent records from more than 30 countries, spanning 10 sociological domains and over 40 demographic attributes. Our experiments indicate that LLMs achieve only 30-40% accuracy when simulating individuals in complex survey scenarios, with statistically significant differences across domains and demographic subgroups. These findings highlight several limitations of current LLMs in survey scenarios, including insufficient individual-level data coverage, inadequate scenario diversity, and missing group-level modeling.
