Population-Aligned Persona Generation for LLM-based Social Simulation
Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
TL;DR
This paper tackles biases in LLM-based social simulation arising from unrepresentative persona sets by introducing a principled, three-stage framework: seed persona mining from large-scale narrative data, global distribution alignment via a two-stage process combining KDE-based importance sampling and entropic optimal transport, and group-specific persona construction through embedding-driven retrieval and targeted LLM revisions. The approach aligns synthesized personas with real-world psychometric distributions (e.g., IPIP Big Five) and demonstrates improved population-level fidelity and individual-level trait consistency across multiple datasets and models, with finite-sample guarantees. Extensive experiments across six psychometric instruments and regional settings show that the proposed Resample method consistently outperforms existing persona baselines and public sets, and that group-specific adaptation further enhances applicability to targeted subpopulations. The work advances reliable, policy-relevant social simulations by ensuring that emergent dynamics reflect realistic population distributions rather than an 'average persona' fallacy.
Abstract
Recent advances in large language models (LLMs) have enabled human-like social simulations at unprecedented scale and fidelity, offering new opportunities for computational social science. A key challenge, however, is the construction of persona sets that authentically represent the diversity and distribution of real-world populations. Most existing LLM-based social simulation studies focus primarily on designing agentic frameworks and simulation environments, often overlooking the complexities of persona generation and the potential biases introduced by unrepresentative persona sets. In this paper, we propose a systematic framework for synthesizing high-quality, population-aligned persona sets for LLM-driven social simulation. Our approach begins by leveraging LLMs to generate narrative personas from long-term social media data, followed by rigorous quality assessment to filter out low-fidelity profiles. We then apply importance sampling to achieve global alignment with reference psychometric distributions, such as the Big Five personality traits. To address the needs of specific simulation contexts, we further introduce a task-specific module that adapts the globally aligned persona set to targeted subpopulations. Extensive experiments demonstrate that our method significantly reduces population-level bias and enables accurate, flexible social simulation for a wide range of research and policy applications.
