Table of Contents
Fetching ...

Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models

Georg Ahnert, Anna-Carolina Haensch, Barbara Plank, Markus Strohmaier

TL;DR

This work addresses the lack of standardization in simulating closed-ended survey responses with LLMs by systematically comparing eight Survey Response Generation Methods across four political-attitude datasets and ten open-weight LLMs. It evaluates both individual-level and subpopulation-level alignment using macro F1-scores and distributional distances, revealing substantial differences across methods. The findings show that Restricted Generation Methods, especially Restricted Choice, yield the best overall alignment and are more computationally efficient than Open Generation approaches, while Token Probability-Based Methods perform poorly, and reasoning-based outputs do not reliably improve results. The study provides practical guidelines for selecting SRG methods in in-silico surveys and discusses limitations, generalizability, and ethical considerations for future work.

Abstract

Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.

Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models

TL;DR

This work addresses the lack of standardization in simulating closed-ended survey responses with LLMs by systematically comparing eight Survey Response Generation Methods across four political-attitude datasets and ten open-weight LLMs. It evaluates both individual-level and subpopulation-level alignment using macro F1-scores and distributional distances, revealing substantial differences across methods. The findings show that Restricted Generation Methods, especially Restricted Choice, yield the best overall alignment and are more computationally efficient than Open Generation approaches, while Token Probability-Based Methods perform poorly, and reasoning-based outputs do not reliably improve results. The study provides practical guidelines for selecting SRG methods in in-silico surveys and discusses limitations, generalizability, and ethical considerations for future work.

Abstract

Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.

Paper Structure

This paper contains 25 sections, 11 figures, 19 tables.

Figures (11)

  • Figure 1: Survey Response Generation Methods Elicit Closed-Ended Survey Responses From LLMs. We prompt all models with a combined Persona & Question Prompt to predict political attitudes in the U.S. or Germany. All implemented Survey Response Generation Methods elicit closed-ended survey responses from the LLMs we investigate. We evaluate the individual-level alignment of these responses against human survey data, and the distribution alignment in subpopulations against human response distributions.
  • Figure 2: Individual-Level Alignment Between In-Silico Generated and Human Survey Responses by Dataset (Columns) and Simulation Specification.Top: macro avg. F1-score $(\uparrow)$ for each aggregated simulation specification, mean across the respective runs. Bottom: simulation specification---Survey Response Generation Method, response option variant, model size, and decoding strategy---sorted by macro avg. F1-score $(\rightarrow)$. Invalid responses are counted as incorrect. Individual-level alignment varies strongly between Survey Response Generation Methods. For subpopulation-level alignment, see Appendix Figures \ref{['app_fig:all_datasets_tv_specification']} & \ref{['app_fig:all_datasets_dcor_specification']}.
  • Figure 3: Mean GPU Time for A Single Survey Response. We run all models using vllm on 2 NVIDIA H100 GPUs (tensor-parallel). We report GPU time instead of token count to accommodate for optimizations such as automatic prefix caching, but also for the overhead that is created by restricting the vocabulary of an LLM with structured outputs. Considering the log-scale y-axis, Open Generation Methods, larger models, and in particular reasoning models require orders of magnitude more GPU time than Token Probability-Based Methods, or the Restricted Choice Method.
  • Figure 4: Subpopulation-Level Alignment: Total Variation Distance/1-Wasserstein Distance. For the ANES and GLES datasets, we use total variation distance to measure alignment on categorical response options. For the ATP dataset, we use 1-Wasserstein Distance to measure alignment on ordinal response options. Top: alignment metric (lower is better) for each aggregated simulation specification, mean across the respective runs. Bottom: simulation specification---Survey Response Generation Method, response option variant, model size, and decoding strategy---sorted by the respective alignment metric. Specifications that lead to more than 10% invalid responses are excluded.
  • Figure 5: Subpopulation-Level Alignment---Global Perspective: Distance Correlation.Top: Distance correlation (higher is better) for each aggregated simulation specification, mean across the respective runs. Bottom: simulation specification---Survey Response Generation Method, response option variant, model size, and decoding strategy---sorted by distance correlation. Specifications that lead to more than 10% invalid responses are excluded.
  • ...and 6 more figures