Table of Contents
Fetching ...

An Analysis of Large Language Models for Simulating User Responses in Surveys

Ziyun Yu, Yiru Zhou, Chen Zhao, Hongyi Wen

TL;DR

The paper investigates the extent to which large language models can simulate individual survey responses across diverse demographic profiles using cross-domain questions from the World Values Survey. It compares Direct Prompting, Chain-of-Thought prompting, and a novel ClaimSim method that generates diverse demographic-specific claims to ground final predictions. Results show ClaimSim increases response diversity and better distribution alignment but none of the approaches accurately emulate real users, with LLMs exhibiting limited reasoning over conflicting evidence and a tendency to maintain uniform viewpoints across demographics due to RLHF biases. The findings highlight fundamental challenges in user-simulation tasks and point to directions for improving diversity and reasoning in social NLP applications, including better prompt design and alignment strategies.

Abstract

Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.

An Analysis of Large Language Models for Simulating User Responses in Surveys

TL;DR

The paper investigates the extent to which large language models can simulate individual survey responses across diverse demographic profiles using cross-domain questions from the World Values Survey. It compares Direct Prompting, Chain-of-Thought prompting, and a novel ClaimSim method that generates diverse demographic-specific claims to ground final predictions. Results show ClaimSim increases response diversity and better distribution alignment but none of the approaches accurately emulate real users, with LLMs exhibiting limited reasoning over conflicting evidence and a tendency to maintain uniform viewpoints across demographics due to RLHF biases. The findings highlight fundamental challenges in user-simulation tasks and point to directions for improving diversity and reasoning in social NLP applications, including better prompt design and alignment strategies.

Abstract

Using Large Language Models (LLMs) to simulate user opinions has received growing attention. Yet LLMs, especially trained with reinforcement learning from human feedback (RLHF), are known to exhibit biases toward dominant viewpoints, raising concerns about their ability to represent users from diverse demographic and cultural backgrounds. In this work, we examine the extent to which LLMs can simulate human responses to cross-domain survey questions through direct prompting and chain-of-thought prompting. We further propose a claim diversification method CLAIMSIM, which elicits viewpoints from LLM parametric knowledge as contextual input. Experiments on the survey question answering task indicate that, while CLAIMSIM produces more diverse responses, both approaches struggle to accurately simulate users. Further analysis reveals two key limitations: (1) LLMs tend to maintain fixed viewpoints across varying demographic features, and generate single-perspective claims; and (2) when presented with conflicting claims, LLMs struggle to reason over nuanced differences among demographic features, limiting their ability to adapt responses to specific user profiles.

Paper Structure

This paper contains 31 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Top: A survey question answering example where LLMs are instructed to simulate individual user responses over diverse demographic profiles. Middle: We study two LLM-based approaches on this task, CoT and ClaimSim. Bottom: ClaimSim produces more diverse answers, while both approaches struggle to simulate users accurately (slightly above random).
  • Figure 2: Comparison of answer distributions averaged across domains for Direct Prompting, CoT, and ClaimSim (left to right with GPT-4o-mini, Llama 4, and QWen 3). ClaimSim leads to more diverse answer distributions.
  • Figure 3: A case study showing LLMs fail to reason over conflicting evidence about opinions.
  • Figure 4: A case study indicating LLMs produce unified viewpoints regardless of attributes.
  • Figure 5: Response distributions of three LLMs across the Gender, Politics, and Religion domains.