Table of Contents
Fetching ...

Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents

Erika Elizabeth Taday Morocho, Lorenzo Cima, Tiziano Fagni, Marco Avvenuti, Stefano Cresci

TL;DR

It is found that persona prompting does not yield a clear aggregate improvement in survey alignment and, in many cases, significantly degrades performance, highlighting a key adverse impact of current persona-based simulation practices.

Abstract

Using persona-conditioned LLMs as synthetic survey respondents has become a common practice in computational social science and agent-based simulations. Yet, it remains unclear whether multi-attribute persona prompting improves LLM reliability or instead introduces distortions. Here we contribute to this assessment by leveraging a large dataset of U.S. microdata from the World Values Survey. Concretely, we evaluate two open-weight chat models and a random-guesser baseline across more than 70K respondent-item instances. We find that persona prompting does not yield a clear aggregate improvement in survey alignment and, in many cases, significantly degrades performance. Persona effects are highly heterogeneous as most items exhibit minimal change, while a small subset of questions and underrepresented subgroups experience disproportionate distortions. Our findings highlight a key adverse impact of current persona-based simulation practices: demographic conditioning can redistribute error in ways that undermine subgroup fidelity and risk misleading downstream analyses.

Assessing the Reliability of Persona-Conditioned LLMs as Synthetic Survey Respondents

TL;DR

It is found that persona prompting does not yield a clear aggregate improvement in survey alignment and, in many cases, significantly degrades performance, highlighting a key adverse impact of current persona-based simulation practices.

Abstract

Using persona-conditioned LLMs as synthetic survey respondents has become a common practice in computational social science and agent-based simulations. Yet, it remains unclear whether multi-attribute persona prompting improves LLM reliability or instead introduces distortions. Here we contribute to this assessment by leveraging a large dataset of U.S. microdata from the World Values Survey. Concretely, we evaluate two open-weight chat models and a random-guesser baseline across more than 70K respondent-item instances. We find that persona prompting does not yield a clear aggregate improvement in survey alignment and, in many cases, significantly degrades performance. Persona effects are highly heterogeneous as most items exhibit minimal change, while a small subset of questions and underrepresented subgroups experience disproportionate distortions. Our findings highlight a key adverse impact of current persona-based simulation practices: demographic conditioning can redistribute error in ways that undermine subgroup fidelity and risk misleading downstream analyses.
Paper Structure (19 sections, 3 equations, 1 figure, 5 tables)

This paper contains 19 sections, 3 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Detailed results for the Llama-2-13B (top row) and Qwen3-4B (bottom row) models. For each model, we report a comparison of hard similarity (HS, blue-colored, left panel) and soft similarity (SS, red-colored, central panel) scores obtained for each question using the vanilla (V, x-axis) and persona-based (PB, y-axis) versions of the models. Additionally, for each model, the right panel shows the item-wise differences between the PB and V model variants, in terms of HS (blue-colored, top) and SS (red-colored, bottom). Positive differences ($>0$) indicate that PB outperforms V for the corresponding item and metric.