Table of Contents
Fetching ...

Adaptive political surveys and GPT-4: Tackling the cold start problem with simulated user interactions

Fynn Bachmann, Daan van der Weijden, Lucien Heitz, Cristina Sarasua, Abraham Bernstein

TL;DR

This work finds that an off-the-shelf LLM (GPT-4) accurately generates answers to the Smartvote questionnaire from the perspective of different Swiss parties and demonstrates that initialising the statistical model with synthetic data can reduce the error in predicting user responses and increase the candidate recommendation accuracy of the VAA.

Abstract

Adaptive questionnaires dynamically select the next question for a survey participant based on their previous answers. Due to digitalisation, they have become a viable alternative to traditional surveys in application areas such as political science. One limitation, however, is their dependency on data to train the model for question selection. Often, such training data (i.e., user interactions) are unavailable a priori. To address this problem, we (i) test whether Large Language Models (LLM) can accurately generate such interaction data and (ii) explore if these synthetic data can be used to pre-train the statistical model of an adaptive political survey. To evaluate this approach, we utilise existing data from the Swiss Voting Advice Application (VAA) Smartvote in two ways: First, we compare the distribution of LLM-generated synthetic data to the real distribution to assess its similarity. Second, we compare the performance of an adaptive questionnaire that is randomly initialised with one pre-trained on synthetic data to assess their suitability for training. We benchmark these results against an "oracle" questionnaire with perfect prior knowledge. We find that an off-the-shelf LLM (GPT-4) accurately generates answers to the Smartvote questionnaire from the perspective of different Swiss parties. Furthermore, we demonstrate that initialising the statistical model with synthetic data can (i) significantly reduce the error in predicting user responses and (ii) increase the candidate recommendation accuracy of the VAA. Our work emphasises the considerable potential of LLMs to create training data to improve the data collection process in adaptive questionnaires in LLM-affine areas such as political surveys.

Adaptive political surveys and GPT-4: Tackling the cold start problem with simulated user interactions

TL;DR

This work finds that an off-the-shelf LLM (GPT-4) accurately generates answers to the Smartvote questionnaire from the perspective of different Swiss parties and demonstrates that initialising the statistical model with synthetic data can reduce the error in predicting user responses and increase the candidate recommendation accuracy of the VAA.

Abstract

Adaptive questionnaires dynamically select the next question for a survey participant based on their previous answers. Due to digitalisation, they have become a viable alternative to traditional surveys in application areas such as political science. One limitation, however, is their dependency on data to train the model for question selection. Often, such training data (i.e., user interactions) are unavailable a priori. To address this problem, we (i) test whether Large Language Models (LLM) can accurately generate such interaction data and (ii) explore if these synthetic data can be used to pre-train the statistical model of an adaptive political survey. To evaluate this approach, we utilise existing data from the Swiss Voting Advice Application (VAA) Smartvote in two ways: First, we compare the distribution of LLM-generated synthetic data to the real distribution to assess its similarity. Second, we compare the performance of an adaptive questionnaire that is randomly initialised with one pre-trained on synthetic data to assess their suitability for training. We benchmark these results against an "oracle" questionnaire with perfect prior knowledge. We find that an off-the-shelf LLM (GPT-4) accurately generates answers to the Smartvote questionnaire from the perspective of different Swiss parties. Furthermore, we demonstrate that initialising the statistical model with synthetic data can (i) significantly reduce the error in predicting user responses and (ii) increase the candidate recommendation accuracy of the VAA. Our work emphasises the considerable potential of LLMs to create training data to improve the data collection process in adaptive questionnaires in LLM-affine areas such as political surveys.

Paper Structure

This paper contains 36 sections, 4 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Schematic overview. Users interact with an adaptive questionnaire. The statistical model sequentially selects the next question for each user. After each user response, the model is updated. When users drop out, their remaining answers are imputed by the model's predictions. An LLM is used to generate training data for the models' initialisation.
  • Figure 2: Latent space of the statistical model fitted to the candidates' dataset. A: The decision boundary for the logistic regression of the question "Do you support the increase of the retirement age (e.g., to 67)?" is shown. The colours of the candidates represent their respective agreement with this question. B: Based on the candidate's responses and the likelihoods of the questions, the resulting posterior distribution is shown for the liberal FDP candidate Nr. 9 ( indicated by the black arrow). The other candidates are coloured by their party membership.
  • Figure 3: Data generation results with GPT-4. A: The PCA projection of the candidates (orange dots) shows their distribution in a two-dimensional space. In blue dots, the GPTvoters dataset as linear combinations of the party vertices (coloured triangles) is projected onto the same axes. The clusters of party-coloured circles correspond to the GPT dataset. B: In the same two-dimensional space, GPTmeans (triangles) are compared to the real party-means (circles). The dashed ellipses represent the 1-$\sigma$ confidence interval of the party-means. The individual candidates are coloured by their party membership.
  • Figure 4: GPT samples compared to candidates' responses. For each question, the mean and standard deviation of the candidates of the respective party are shown by the blue dots and horizontal error bars. In orange, the means and standard deviations of the GPT samples are shown. The question "Should direct payments only be granted to farmers with proof of ecological performance?" is highlighted by a black circle.
  • Figure 5: Simulation results with different training data and $K=30$. A: For the downstream task to impute the missing values, the RMSE quickly converges to the benchmark (when the model is trained with the candidates' dataset). The blue line shows the RMSE of imputing the remaining questions in the cold start setting. The other lines correspond to the model performance after initialisation with different variations of GPT-4 generated data. The vertical lines indicate the number of users for which Coldstart and GPTvoters intersect (here, after 175 users). B: For the downstream task to recommend the nearest candidates, the CRA slowly approaches the benchmark. The blue line shows the CRA in the cold start setting. The other lines correspond to the model performance after initialisation with different variations of the GPT-4 generated data. The vertical lines indicate the break-even point, where Coldstart and GPTvoters intersect (here, after 485 users).
  • ...and 10 more figures