Table of Contents
Fetching ...

Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data

Enzo Sinacola, Arnault Pachot, Thierry Petit

TL;DR

This paper investigates whether Large Language Models (LLMs) can generate virtual respondents to predict survey outcomes and how they compare to traditional models like Random Forests when trained on demographic data. Using World Values Survey Wave 7 data, the authors evaluate multiple LLMs and RFs through a four-phase methodology, including prompt/temperature optimization, cross-model comparisons, demographic subgroup analyses, and censorship (uncensored vs censored) effects. The key finding is that LLMs are competitive with RFs and do not require training data, but they exhibit biases across ethnic and religious groups; removing censorship improves predictive accuracy, especially for underrepresented populations. While RFs surpass LLMs when trained on large data, the no-training-data advantage and scalability of LLMs offer practical benefits for rapid, large-scale public opinion polling, provided bias mitigation and censorship strategies are refined.

Abstract

Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research.

Llms, Virtual Users, and Bias: Predicting Any Survey Question Without Human Data

TL;DR

This paper investigates whether Large Language Models (LLMs) can generate virtual respondents to predict survey outcomes and how they compare to traditional models like Random Forests when trained on demographic data. Using World Values Survey Wave 7 data, the authors evaluate multiple LLMs and RFs through a four-phase methodology, including prompt/temperature optimization, cross-model comparisons, demographic subgroup analyses, and censorship (uncensored vs censored) effects. The key finding is that LLMs are competitive with RFs and do not require training data, but they exhibit biases across ethnic and religious groups; removing censorship improves predictive accuracy, especially for underrepresented populations. While RFs surpass LLMs when trained on large data, the no-training-data advantage and scalability of LLMs offer practical benefits for rapid, large-scale public opinion polling, provided bias mitigation and censorship strategies are refined.

Abstract

Large Language Models (LLMs) offer a promising alternative to traditional survey methods, potentially enhancing efficiency and reducing costs. In this study, we use LLMs to create virtual populations that answer survey questions, enabling us to predict outcomes comparable to human responses. We evaluate several LLMs-including GPT-4o, GPT-3.5, Claude 3.5-Sonnet, and versions of the Llama and Mistral models-comparing their performance to that of a traditional Random Forests algorithm using demographic data from the World Values Survey (WVS). LLMs demonstrate competitive performance overall, with the significant advantage of requiring no additional training data. However, they exhibit biases when predicting responses for certain religious and population groups, underperforming in these areas. On the other hand, Random Forests demonstrate stronger performance than LLMs when trained with sufficient data. We observe that removing censorship mechanisms from LLMs significantly improves predictive accuracy, particularly for underrepresented demographic segments where censored models struggle. These findings highlight the importance of addressing biases and reconsidering censorship approaches in LLMs to enhance their reliability and fairness in public opinion research.

Paper Structure

This paper contains 27 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Dataset division into input features and target variables
  • Figure 2: Average Accuracy by Models
  • Figure 3: Average Accuracy by Region
  • Figure 4: Average Accuracy by Religion
  • Figure 5: Comparison of Model Accuracy Across Population Groups for Dolphin-Llama3-8B and Llama3-8B
  • ...and 1 more figures