AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

Junsol Kim; Byungkyu Lee

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

Junsol Kim, Byungkyu Lee

TL;DR

It is demonstrated that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

Abstract

Large language models (LLMs) that produce human-like responses have begun to revolutionize research practices in the social sciences. We develop a novel methodological framework that fine-tunes LLMs with repeated cross-sectional surveys to incorporate the meaning of survey questions, individual beliefs, and temporal contexts for opinion prediction. We introduce two new emerging applications of the AI-augmented survey: retrodiction (i.e., predict year-level missing responses) and unasked opinion prediction (i.e., predict entirely missing responses). Among 3,110 binarized opinions from 68,846 Americans in the General Social Survey from 1972 to 2021, our models based on Alpaca-7b excel in retrodiction (AUC = 0.86 for personal opinion prediction, $ρ$ = 0.98 for public opinion prediction). These remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. On the other hand, our fine-tuned Alpaca-7b models show modest success in unasked opinion prediction (AUC = 0.73, $ρ$ = 0.67). We discuss practical constraints and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. Our study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

TL;DR

It is demonstrated that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

Abstract

= 0.98 for public opinion prediction). These remarkable prediction capabilities allow us to fill in missing trends with high confidence and pinpoint when public attitudes changed, such as the rising support for same-sex marriage. On the other hand, our fine-tuned Alpaca-7b models show modest success in unasked opinion prediction (AUC = 0.73,

= 0.67). We discuss practical constraints and ethical concerns regarding individual autonomy and privacy when using LLMs for opinion prediction. Our study demonstrates that LLMs and surveys can mutually enhance each other's capabilities: LLMs can broaden survey potential, while surveys can improve the alignment of LLMs.

Paper Structure (24 sections, 1 equation, 17 figures, 3 tables)

This paper contains 24 sections, 1 equation, 17 figures, 3 tables.

Introduction
Challenges in predicting survey responses
Promises and Challenges of Machine Learning in Opinion Prediction
Fine-tuning Large Language Models with Nationally Representative Surveys
Data and Method
Results
Discussion
Appendix for AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction
Appendix Tables and Figures

Figures (17)

Figure 1: Three types of missing problems in survey research. Panels A-C illustrate three typical missing data challenges in survey research. Each row indicates an individual subject in a social survey across different periods, and each column (i.e., X, Y, and Z) indicates public opinion variables that we aim to measure. The machine learning task in each situation is to predict the unobserved values [?] in the black cells using the observed values in the white cells.
Figure 2: An overview of our methodological framework. In Panel A, we use survey weights when aggregating individual-level prediction into population-level estimates to account for potential sampling bias. In Panel B, individual belief and period embeddings are initially randomly assigned but optimized during the fine-tuning process using dense and cross layers. Semantic embedding, initially estimated by pre-trained LLMs (e.g., Alpaca-7b), is also optimized during the fine-tuning stage.
Figure 3: Model performance for predicting three types of missing responses at individual and aggregate levels. Panel A displays the Receiver Operating Characteristic (ROC) curve, indicating how well a model can predict missing responses at an individual level. We also denote the AUC (Area Under Curve) values, i.e., the probability of the model ranking a randomly selected positive response over a randomly selected negative response. Panels B-D depict the relationship between the observed proportion of those who agree in a survey each year and the predicted proportion of the agreement for the same opinion. The percentage of correct predictions within a margin of error of 3% is indicated as "% Correct Â± 3%," which implies that the difference between the actual and predicted opinions is 3% or less. We color the predictions that fall within this range.
Figure 4: Illustration of the potential application of our models and matrix factorization models for predicting counter-factual trends in the GSS 1972-2021. The generalized additive model was used to estimate the counterfactual trends. We define the correct prediction when the prediction interval within 3% margin of error includes the observed estimate. The variable name, response option, and wording of questions for each panel are followed: Panels A1, B1. "What about sexual relations between two adults of the same sex--do you think it is always wrong (=1), almost always wrong(=1), wrong only sometimes (=1), or not wrong at all (=0)?" (homosex). Panels A2, B2. "Do you agree or disagree with the following statement? Homosexual couples have the right to marry one another. Strongly agree (=1), agree (=1), neither agree nor disagree (=0), disagree (=0), strongly disagree (=0)" (marhomo1). Panels A3, B3. "In general, do you favor or oppose the busing of (Negro/Black/African-American) and white school children from one school district to another? Favor (=1), Oppose (=0)" (busing). Panels A4, B4. "And how often do you refuse to eat meat for moral or environmental reasons? Always or Often (=1), Sometimes or Never (=0)" (nomeat).
Figure 5: Coefficient plots from OLS regression models predicting individual-level AUC across three different types of missing response prediction. A higher AUC value indicates greater model accuracy for individuals. Here, each dot represents the expected difference of AUC (i.e., average marginal effects) against the reference group within each subgroup with the 95% confidence intervals. Red bars indicate that the AUC for a particular group is below the AUC of the reference group, and blue bars indicate that the AUC for a particular group is above the AUC of the reference group. Here, a filled dot refers to a statistically significant difference, and an X refers to a statistically insignificant difference based on robust standard errors (p < 0.05).
...and 12 more figures

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

TL;DR

Abstract

AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (17)