Assessing Generalization for Subpopulation Representative Modeling via In-Context Learning
Gabriel Simmons, Vladislav Savinov
TL;DR
The paper investigates how well LLM-based Subpopulation Representative Models (SRMs) generalize beyond the conditioning data used in in-context learning, focusing on generalization across response variables and demographics using ANES data. It formalizes fidelity with $E(d, V_c, D_{fs})$ and $E(d, n_c, n_{fs})$, and evaluates performance using prompting with gpt-3.5-turbo under zero-shot and few-shot conditions. The results show that fidelity generally improves with more conditioning variables and more few-shot examples, but the degree of improvement is highly heterogeneous across demographic groups, with some groups benefitting little or even experiencing worse fidelity. These findings highlight ethical and practical concerns for deploying SRMs, underscoring the need for fine-grained benchmarks and strategies to ensure equitable generalization across subpopulations in political and social science contexts.
Abstract
This study evaluates the ability of Large Language Model (LLM)-based Subpopulation Representative Models (SRMs) to generalize from empirical data, utilizing in-context learning with data from the 2016 and 2020 American National Election Studies. We explore generalization across response variables and demographic subgroups. While conditioning with empirical data improves performance on the whole, the benefit of in-context learning varies considerably across demographics, sometimes hurting performance for one demographic while helping performance for others. The inequitable benefits of in-context learning for SRM present a challenge for practitioners implementing SRMs, and for decision-makers who might come to rely on them. Our work highlights a need for fine-grained benchmarks captured from diverse subpopulations that test not only fidelity but generalization.
