Table of Contents
Fetching ...

Using LLMs to Model the Beliefs and Preferences of Targeted Populations

Keiichi Namikoshi, Alex Filipowicz, David A. Shamma, Rumen Iliev, Candice L. Hogan, Nikos Arechiga

TL;DR

The paper addresses aligning LLMs to model the beliefs and preferences of a real human population for virtual surveys and interventions. It proposes prompting-based virtual participants and parameter-efficient fine-tuning (LoRA/QLoRA) on the EV-shift BEV survey, augmented by a novel numerical penalty term and analyzed across model size, quantization, and sampling temperature. Key findings show larger models improve population-wide and individual-fit metrics before fine-tuning, while fine-tuning narrows gains; QLoRA offers efficient fine-tuning with minimal performance loss; calibrated sampling reduces population-level divergence while greedy sampling improves per-individual RMSE, and the numerical penalty term enhances accuracy on numeric questions. The work demonstrates practical viability for population-level simulations and interactive belief modeling, with implications for marketing, intervention design, and policy testing in settings where real-world studies are costly or unethical.

Abstract

We consider the problem of aligning a large language model (LLM) to model the preferences of a human population. Modeling the beliefs, preferences, and behaviors of a specific population can be useful for a variety of different applications, such as conducting simulated focus groups for new products, conducting virtual surveys, and testing behavioral interventions, especially for interventions that are expensive, impractical, or unethical. Existing work has had mixed success using LLMs to accurately model human behavior in different contexts. We benchmark and evaluate two well-known fine-tuning approaches and evaluate the resulting populations on their ability to match the preferences of real human respondents on a survey of preferences for battery electric vehicles (BEVs). We evaluate our models against their ability to match population-wide statistics as well as their ability to match individual responses, and we investigate the role of temperature in controlling the trade-offs between these two. Additionally, we propose and evaluate a novel loss term to improve model performance on responses that require a numeric response.

Using LLMs to Model the Beliefs and Preferences of Targeted Populations

TL;DR

The paper addresses aligning LLMs to model the beliefs and preferences of a real human population for virtual surveys and interventions. It proposes prompting-based virtual participants and parameter-efficient fine-tuning (LoRA/QLoRA) on the EV-shift BEV survey, augmented by a novel numerical penalty term and analyzed across model size, quantization, and sampling temperature. Key findings show larger models improve population-wide and individual-fit metrics before fine-tuning, while fine-tuning narrows gains; QLoRA offers efficient fine-tuning with minimal performance loss; calibrated sampling reduces population-level divergence while greedy sampling improves per-individual RMSE, and the numerical penalty term enhances accuracy on numeric questions. The work demonstrates practical viability for population-level simulations and interactive belief modeling, with implications for marketing, intervention design, and policy testing in settings where real-world studies are costly or unethical.

Abstract

We consider the problem of aligning a large language model (LLM) to model the preferences of a human population. Modeling the beliefs, preferences, and behaviors of a specific population can be useful for a variety of different applications, such as conducting simulated focus groups for new products, conducting virtual surveys, and testing behavioral interventions, especially for interventions that are expensive, impractical, or unethical. Existing work has had mixed success using LLMs to accurately model human behavior in different contexts. We benchmark and evaluate two well-known fine-tuning approaches and evaluate the resulting populations on their ability to match the preferences of real human respondents on a survey of preferences for battery electric vehicles (BEVs). We evaluate our models against their ability to match population-wide statistics as well as their ability to match individual responses, and we investigate the role of temperature in controlling the trade-offs between these two. Additionally, we propose and evaluate a novel loss term to improve model performance on responses that require a numeric response.
Paper Structure (18 sections, 2 equations, 11 figures, 10 tables)

This paper contains 18 sections, 2 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Convert survey data to prompt text. Left: Initial preference questionnaire. Right: Post-intervention preference questionnaire.
  • Figure 2: Benchmark results. Line plots indicate the baseline points for each hyper-parameter. The left two plots show greedy sampling results ($t=0$). With greedy sampling, the fine-tuned models all outperform the pre-trained models. The largest model attains the best KL-divergence, but not the best RMSE score. None of the models outperform the supervised learning baselines on either RMSE or KL-divergence. The right two plots show calibrated sampling ($t=1$). All models outperform the baselines on KL-divergence, but not on RMSE. The square boxes overlap because the difference in fine-tuned model performance is small. Each value of fine-tuned models is described on Table \ref{['table:prformance_qlora_t1']}.
  • Figure 3: Sampling temperature effects. 7B+QLoRA. A temperature of 0.0 corresponds to greedy sampling, and a temperature of 1.0 corresponds to calibrated sampling. Varying the temperature allows trading off the population-wide statistical metric of KL-divergence against the per-individual RMSE metric.
  • Figure 4: Numerical penalty term effects. 7B+QLoRA. The coefficient $\alpha$ of the penalty term is fixed at 0.5. Penalty term allowed to decrease RMSE, It tends to decrease RMSE the most when $d=10$.
  • Figure 5: Learning curve (dash line indicate 1 epoch position.)
  • ...and 6 more figures