Table of Contents
Fetching ...

LLM Generated Persona is a Promise with a Catch

Ang Li, Haozhe Chen, Hongseok Namkoong, Tianyi Peng

TL;DR

The paper investigates the viability of using LLM-generated personas to simulate real populations at scale, highlighting substantial biases introduced during persona generation and their impact on downstream analyses such as election forecasts and opinion surveys. It introduces a structured taxonomy of persona types (Meta, Tabular, Descriptive) and conducts large-scale experiments across political and general domains to quantify misalignment with real-world data. The findings show that increasing LLM-generated content amplifies biases and shifts in simulated opinions, underscoring the need for a rigorous science of persona generation, calibration against real joint distributions, and open benchmarks. The work argues for interdisciplinary collaboration and provides open-source resources to accelerate the development of reliable, scalable, and human-centric silicon-sample simulations.

Abstract

The use of large language models (LLMs) to simulate human behavior has gained significant attention, particularly through personas that approximate individual characteristics. Persona-based simulations hold promise for transforming disciplines that rely on population-level feedback, including social science, economic analysis, marketing research, and business operations. Traditional methods to collect realistic persona data face significant challenges. They are prohibitively expensive and logistically challenging due to privacy constraints, and often fail to capture multi-dimensional attributes, particularly subjective qualities. Consequently, synthetic persona generation with LLMs offers a scalable, cost-effective alternative. However, current approaches rely on ad hoc and heuristic generation techniques that do not guarantee methodological rigor or simulation precision, resulting in systematic biases in downstream tasks. Through extensive large-scale experiments including presidential election forecasts and general opinion surveys of the U.S. population, we reveal that these biases can lead to significant deviations from real-world outcomes. Our findings underscore the need to develop a rigorous science of persona generation and outline the methodological innovations, organizational and institutional support, and empirical foundations required to enhance the reliability and scalability of LLM-driven persona simulations. To support further research and development in this area, we have open-sourced approximately one million generated personas, available for public access and analysis at https://huggingface.co/datasets/Tianyi-Lab/Personas.

LLM Generated Persona is a Promise with a Catch

TL;DR

The paper investigates the viability of using LLM-generated personas to simulate real populations at scale, highlighting substantial biases introduced during persona generation and their impact on downstream analyses such as election forecasts and opinion surveys. It introduces a structured taxonomy of persona types (Meta, Tabular, Descriptive) and conducts large-scale experiments across political and general domains to quantify misalignment with real-world data. The findings show that increasing LLM-generated content amplifies biases and shifts in simulated opinions, underscoring the need for a rigorous science of persona generation, calibration against real joint distributions, and open benchmarks. The work argues for interdisciplinary collaboration and provides open-source resources to accelerate the development of reliable, scalable, and human-centric silicon-sample simulations.

Abstract

The use of large language models (LLMs) to simulate human behavior has gained significant attention, particularly through personas that approximate individual characteristics. Persona-based simulations hold promise for transforming disciplines that rely on population-level feedback, including social science, economic analysis, marketing research, and business operations. Traditional methods to collect realistic persona data face significant challenges. They are prohibitively expensive and logistically challenging due to privacy constraints, and often fail to capture multi-dimensional attributes, particularly subjective qualities. Consequently, synthetic persona generation with LLMs offers a scalable, cost-effective alternative. However, current approaches rely on ad hoc and heuristic generation techniques that do not guarantee methodological rigor or simulation precision, resulting in systematic biases in downstream tasks. Through extensive large-scale experiments including presidential election forecasts and general opinion surveys of the U.S. population, we reveal that these biases can lead to significant deviations from real-world outcomes. Our findings underscore the need to develop a rigorous science of persona generation and outline the methodological innovations, organizational and institutional support, and empirical foundations required to enhance the reliability and scalability of LLM-driven persona simulations. To support further research and development in this area, we have open-sourced approximately one million generated personas, available for public access and analysis at https://huggingface.co/datasets/Tianyi-Lab/Personas.

Paper Structure

This paper contains 33 sections, 10 figures.

Figures (10)

  • Figure 1: Left: An example of a LLM generated persona. Right: Applications of personas in the real world.
  • Figure 2: (a) Persona-driven simulation enables simulating human behaviors with LLMs. (b) Using LLM to generate personas promises scalable simulation of diverse population's behaviors. (c) We caution that improper usage of LLMs as persona generators may lead to homogeneous results.
  • Figure 3: We categorize existing persona generation approaches into four tiers. Each tier adds more information generated by LLMs in generated personas to the previous tier.
  • Figure 4: Persona-based simulations of elections 2016, 2020, and 2024.
  • Figure 5: Alignment scores for cross-model simulation. Each column represents a simulation model, while the x-axis within each column corresponds to the persona generation model. The "meta" point is singular as it relies on sampling rather than generation.
  • ...and 5 more figures