Table of Contents
Fetching ...

LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation

Lei Wang, Yuanzi Li, Jinchao Wu, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen

Abstract

Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a large-scale social simulation system designed around three core requirements: generality via auto-programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback-driven LLM fine-tuning. Leveraging this system, S-Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human-AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S-Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod's theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S-Researcher establishes a new human--AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.

LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation

Abstract

Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by "siliconizing" both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a large-scale social simulation system designed around three core requirements: generality via auto-programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback-driven LLM fine-tuning. Leveraging this system, S-Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human-AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S-Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod's theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S-Researcher establishes a new human--AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.

Paper Structure

This paper contains 12 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Overview of S-Researcher. (a) The complete workflow of S-Researcher: users input their research topics, after which simulation scenarios are automatically constructed, executed, and summarized into comprehensive reports. Researchers can intervene at every stage. (b) Summary of three case studies organized by reasoning paradigm: induction, deduction, and abduction.
  • Figure 2: Platform capability validation of YuLan-OneSim.a, Human expert ratings for auto-generated code across eight social science domains. b, Error type distribution across domains. c, Ablation study on workflow scores: OneSim vs. w/o G-Valid. d, Code quality ablation: four variant comparison. e, Runtime scaling with agent count. f, Distributed vs. single-node deployment efficiency. g, Feedback-driven optimization trajectories for two backbone models (Qwen2.5-1.5B, Llama-3.2-1B) under SFT and DPO strategies.
  • Figure 3: Inductive paradigm: S-Researcher autonomously reproduces Axelrod's cultural dissemination dynamics, confirming coexistence of local convergence and global polarization.a, Research question input to S-Researcher. b, Simulation protocol following the ODD standard. c, Experimental setup: 100 LLM agents on a $10\times10$ grid, 5 cultural features $\times$ 5 values, 100 rounds, 3 replicates. d, Dual-metric time series: local convergence (blue, left axis) increases from $\sim$0.20 to $\sim$0.24 (+21.0%); cultural diversity (red, right axis) decreases from $\sim$1.0 to $\sim$0.65. Shaded areas show variability across replicates. e, Distribution of neighbor cultural similarity across rounds, with cumulative high-similarity proportion (sim $\geq$ 0.6) increasing from 12.0% to 50.0%. f, Cultural cluster composition over time: singleton agents decline while large clusters ($4+$) grow from 0% to 38%. g, Dominant value share evolution across five cultural dimensions, rising from the uniform baseline (20%) to a range of 32-37%. h, Pairwise cultural similarity matrices at four time points (Rounds 1, 10, 50, 100), showing progressive sharpening of cultural boundaries.
  • Figure 4: Deductive paradigm: bottom-up classroom simulation independently recovers the empirically established dominance of expressive ability in teacher attention allocation.a, Research question input to S-Researcher. b, Simulation protocol following the ODD standard. c, Experimental setup: 221 simulated classrooms, 5,525 student agents with profiles from CEPS, 30 rounds $\times$ 3 replicates. d, Spearman $\rho$ between simulated and empirical CEPS attention rankings ($\uparrow$): Expression ($0.152$) $>$ Merit ($0.122$) $>$ Elite ($0.113$). e, Root-mean-square error ($\downarrow$): Expression ($0.846$) yields the best fit. f, Convergence of $\rho$ over 30 simulation rounds; hypothesis ranking stabilizes after round 5. g, Independent validation via CEPS regression: communicative ability ($\beta = 0.349$, $R^2 = 12.1\%$) explains substantially more variance than academic achievement or SES, confirming the simulation-derived ordering. h, Attention dynamics comparison: transition matrices from simulation (early vs. late phase) and CEPS two-wave panel data (Wave 1 vs. Wave 2) both exhibit diagonal-dominant persistence; simulation reproduces the qualitative structure but overestimates rigidity (e.g., Low$\to$High: $1.8\%$ vs. $16.6\%$), consistent with its role as a mechanism-isolating experiment.
  • Figure 5: Abductive paradigm: counterfactual decomposition of follower cooperation in public goods games reveals behavioral anchoring as the dominant mechanism and uncovers an unexpected forced $>$ voluntary effect.a, Research question: what causal mechanisms drive follower cooperation when leaders contribute first? b, Simulation protocol following the ODD standard. c, Experimental setup: $2 \times 3$ between-subjects design (voluntary/forced $\times$ low/medium/high contribution), 100 agent followers per condition, 3 replicates; parallel human experiment ($N = 120$). d, LLM agent simulation box plots ($N_F = 100$, $n = 3$): follower contributions under voluntary (blue) and forced (orange) conditions across leader contribution levels; diamonds mark means. Forced conditions elicit higher contributions at every level. e, Human-agent alignment scatter plot across 6 conditions, Pearson $r = 0.915$. f, Human experiment box plots ($N = 120$): humans show the same forced $>$ voluntary pattern, with the gap widening at medium and high contribution levels. g, Effect size comparison forest plot: both agents and humans show significant effects of leader contribution ($\beta_{\text{agent}} = 0.794$, $\beta_{\text{human}} = 0.491$) and decision mechanism ($\beta_{\text{agent}} = 0.104$, $\beta_{\text{human}} = 0.251$), with leader contribution consistently the dominant factor.
  • ...and 1 more figures