Table of Contents
Fetching ...

Survey Transfer Learning: Recycling Data with Silicon Responses

Ali Amini

TL;DR

The paper tackles the environmental and methodological drawbacks of using large language models to generate synthetic survey data. It proposes Survey Transfer Learning (STL), which reuses gold-standard survey data (CES and ANES) and transfers learned demographic–partisan structure via Anchor Transfer Variables in a three-stage backbone–head neural network to produce empirically grounded silicon responses. STL achieves strong cross-survey performance, e.g., $AUC \approx 0.97$ for vote prediction and distributional fidelity with $KS < 0.03$ and $Wasserstein < 0.03$, outperforming LLM-generated data and traditional imputation on sensitive measures like racial resentment. The approach offers a sustainable, transparent alternative for missing data imputation and cross-survey augmentation, enabling reproducible research while reducing environmental impact. It also frames surveys as interconnected data resources, paving the way for broader cross-survey integration and methodological innovations in political science.

Abstract

As researchers increasingly turn to large language models (LLMs) to generate synthetic survey data, less attention has been paid to alternative AI paradigms given environmental costs of LLMs. This paper introduces Survey Transfer Learning (STL), which develops transfer learning paradigms from computer science for survey research to recycle existing survey data and generate empirically grounded silicon responses. Inspired by political behavior theory, STL leverages shared demographic variables with high predictive power in a polarized American context to transfer knowledge across surveys. Using a neural network pre-trained on the Cooperative Election Study (CES) 2020, freezing early layers to preserve learned structure, and fine-tuning top layers on the American National Election Studies (ANES) 2020, STL generates silicon responses CES 2022 and in held-out ANES 2020 data with accuracy rates of up to 93 percent. Results show that STL outperforms LLMs, especially on sensitive measures such as racial resentment. While LLMs silicon samples are costly and opaque, STL generates empirically grounded silicon responses with high individual-level accuracy, potentially helping to mitigate key challenges in social science and the polling industry.

Survey Transfer Learning: Recycling Data with Silicon Responses

TL;DR

The paper tackles the environmental and methodological drawbacks of using large language models to generate synthetic survey data. It proposes Survey Transfer Learning (STL), which reuses gold-standard survey data (CES and ANES) and transfers learned demographic–partisan structure via Anchor Transfer Variables in a three-stage backbone–head neural network to produce empirically grounded silicon responses. STL achieves strong cross-survey performance, e.g., for vote prediction and distributional fidelity with and , outperforming LLM-generated data and traditional imputation on sensitive measures like racial resentment. The approach offers a sustainable, transparent alternative for missing data imputation and cross-survey augmentation, enabling reproducible research while reducing environmental impact. It also frames surveys as interconnected data resources, paving the way for broader cross-survey integration and methodological innovations in political science.

Abstract

As researchers increasingly turn to large language models (LLMs) to generate synthetic survey data, less attention has been paid to alternative AI paradigms given environmental costs of LLMs. This paper introduces Survey Transfer Learning (STL), which develops transfer learning paradigms from computer science for survey research to recycle existing survey data and generate empirically grounded silicon responses. Inspired by political behavior theory, STL leverages shared demographic variables with high predictive power in a polarized American context to transfer knowledge across surveys. Using a neural network pre-trained on the Cooperative Election Study (CES) 2020, freezing early layers to preserve learned structure, and fine-tuning top layers on the American National Election Studies (ANES) 2020, STL generates silicon responses CES 2022 and in held-out ANES 2020 data with accuracy rates of up to 93 percent. Results show that STL outperforms LLMs, especially on sensitive measures such as racial resentment. While LLMs silicon samples are costly and opaque, STL generates empirically grounded silicon responses with high individual-level accuracy, potentially helping to mitigate key challenges in social science and the polling industry.
Paper Structure (5 sections, 11 equations, 11 figures, 8 tables)

This paper contains 5 sections, 11 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Survey Transfer Learning (STL) Framework for Cross-Survey Domain Adaptation. The STL approach reuses knowledge from a source survey (CES 2020) to predict the same outcome in a target survey (ANES 2020). Both datasets share Anchor Transfer Variables (ATVs)—demographic and ideological features such as age, education, gender, income, party identification (PID), race, and region. In Stage 1, the backbone and head are trained on the source survey to predict a known policy outcome ($Y_1$). In Stage 2, the backbone’s earlier layers (orange) are frozen to preserve CES-learned structure, while later backbone layers (peach) and the task-specific head (blue) are fine-tuned on the target survey. The fine-tuned model then generates synthetic responses ($\hat{Y}_1$) for the target dataset, enabling high-accuracy predictions without re-collecting all outcome data. This design ensures that knowledge transfer occurs in the backbone, while the head remains flexible and task-specific.
  • Figure 2: Three-stage Survey Transfer Learning (STL) Framework. All stages share a common backbone $f_{\theta}$, which encodes demographic and ideological representations (Anchor Transfer Variables, ATVs). Stage 1: The model is pre-trained on CES 2020 (Task 1: $Y_1$) using a backbone $f_{\theta}$ and task-specific head $h_1$. Stage 2: The same backbone $f_{\theta}$ is transferred by freezing its early layers to retain useful representations learned from CES data, while making the later layers and a new task-specific head $h_2$ trainable and fine-tuning them on ANES 2020 (Task 2: $Y_2$), yielding an updated backbone $f_{\theta}^{\star}$. Stage 3: The fine-tuned backbone $f_{\theta}^{\star}$ and head $h_2$ are reused to generate empirically grounded silicon responses $\hat{Y}_2$ for CES 2022 or any new dataset with shared features. This framework illustrates how knowledge can be transferred across surveys through a shared representation while adapting to new data distributions.
  • Figure 3: Model performance on vote choice prediction. (a) Aggregate Trump vote predictions compared with actual outcomes. (b) Confusion matrix showing class-level performance.
  • Figure 4: Comparison of a biological neuron and an artificial neuron used in deep learning.
  • Figure 5: The confusion martix.
  • ...and 6 more figures