Table of Contents
Fetching ...

The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, Markus Strohmaier

TL;DR

The paper addresses how sociodemographic persona prompts shape LLM outputs and the risk of stereotyping marginalized groups. It develops a systematic framework with two axes—role-adoption formats and demographic priming—and evaluates open- and closed-ended tasks across 15 demographic groups using five instruction-tuned LLMs. Key findings show that certain prompting strategies, notably interview-style and name-based priming, reduce stereotyping and improve alignment, while larger models can be less representative. The work provides actionable guidance for designing demographic prompts, highlights ethical considerations of using names as proxies, and releases code and data to support replication and further study. Together, these contributions advance fairer, more nuanced sociocultural simulations in LLM-based studies and surveys.

Abstract

Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

The Prompt Makes the Person(a): A Systematic Evaluation of Sociodemographic Persona Prompting for Large Language Models

TL;DR

The paper addresses how sociodemographic persona prompts shape LLM outputs and the risk of stereotyping marginalized groups. It develops a systematic framework with two axes—role-adoption formats and demographic priming—and evaluates open- and closed-ended tasks across 15 demographic groups using five instruction-tuned LLMs. Key findings show that certain prompting strategies, notably interview-style and name-based priming, reduce stereotyping and improve alignment, while larger models can be less representative. The work provides actionable guidance for designing demographic prompts, highlights ethical considerations of using names as proxies, and releases code and data to support replication and further study. Together, these contributions advance fairer, more nuanced sociocultural simulations in LLM-based studies and surveys.

Abstract

Persona prompting is increasingly used in large language models (LLMs) to simulate views of various sociodemographic groups. However, how a persona prompt is formulated can significantly affect outcomes, raising concerns about the fidelity of such simulations. Using five open-source LLMs, we systematically examine how different persona prompt strategies, specifically role adoption formats and demographic priming strategies, influence LLM simulations across 15 intersectional demographic groups in both open- and closed-ended tasks. Our findings show that LLMs struggle to simulate marginalized groups but that the choice of demographic priming and role adoption strategy significantly impacts their portrayal. Specifically, we find that prompting in an interview-style format and name-based priming can help reduce stereotyping and improve alignment. Surprisingly, smaller models like OLMo-2-7B outperform larger ones such as Llama-3.3-70B. Our findings offer actionable guidance for designing sociodemographic persona prompts in LLM-based simulation studies.

Paper Structure

This paper contains 52 sections, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Evaluation Framework for Sociodemographic Persona Prompting. We construct sociodemographic persona prompts using combinations of three different role adoption formats and three strategies for demographic priming. We populate these prompts in conjunction with various sociodemographic groups and systematically evaluate them across both open- and closed-ended tasks using a broad set of bias and alignment measures.
  • Figure 2: Discrepancies in demographic group representation. We find systematic differences in self-descriptions of simulated demographic personas. We show the (a) number of marked words and (b) semantic diversity of generated self-descriptions for each demographic group. Values are aggregated across all prompt types and we apply min-max normalization for each model separately to indicate the relative ranking of groups. We observe that self-descriptions for nonbinary (N) personas generally exhibit the least favorable outcome (i.e., high marked word count and low semantic diversity), while simulations of male (M) personas lead to the most favorable results (i.e., low marked word count and high semantic diversity). Additionally, simulations of Middle-Eastern (ME) and Hispanic personas are generally associated with less favorable outcomes.
  • Figure 3: Opinion distance on OpinionsQA ($\downarrow$). Abbreviations: M = male, F = female. We report the average Wasserstein distance for the best-performing model, OLMo-2-7B. Differences across prompt types are generally modest, but the interview format leads to improved opinion distance (i.e., lower Wasserstein distance). We show the remaining models in Fig. \ref{['fig:QA_dist']} in Appendix \ref{['app:OpinionQA']}.
  • Figure 4: Comparison of prompt types and models. We present the (a) number of marked words and (b) semantic diversity of simulated self-descriptions for each prompt type and model. Values are aggregated across all demographic groups. We find that prompting with names and using the interview format leads to a lower (i.e., better) marked word count for all models. We observe a similar pattern for semantic diversity, with the exception of Gemma-3-27b and Llama-3.3-70B, which generally exhibit the worst performance across both measures (i.e., high marked word count and low semantic diversity).
  • Figure 5: Percentage of non-English self-descriptions. We report the percentage of non-English responses generated for Hispanic personas, who receive the highest proportion of such responses. Explicit demographic priming leads to higher rates of non-English responses.
  • ...and 10 more figures