Table of Contents
Fetching ...

Assessing Generalization for Subpopulation Representative Modeling via In-Context Learning

Gabriel Simmons, Vladislav Savinov

TL;DR

The paper investigates how well LLM-based Subpopulation Representative Models (SRMs) generalize beyond the conditioning data used in in-context learning, focusing on generalization across response variables and demographics using ANES data. It formalizes fidelity with $E(d, V_c, D_{fs})$ and $E(d, n_c, n_{fs})$, and evaluates performance using prompting with gpt-3.5-turbo under zero-shot and few-shot conditions. The results show that fidelity generally improves with more conditioning variables and more few-shot examples, but the degree of improvement is highly heterogeneous across demographic groups, with some groups benefitting little or even experiencing worse fidelity. These findings highlight ethical and practical concerns for deploying SRMs, underscoring the need for fine-grained benchmarks and strategies to ensure equitable generalization across subpopulations in political and social science contexts.

Abstract

This study evaluates the ability of Large Language Model (LLM)-based Subpopulation Representative Models (SRMs) to generalize from empirical data, utilizing in-context learning with data from the 2016 and 2020 American National Election Studies. We explore generalization across response variables and demographic subgroups. While conditioning with empirical data improves performance on the whole, the benefit of in-context learning varies considerably across demographics, sometimes hurting performance for one demographic while helping performance for others. The inequitable benefits of in-context learning for SRM present a challenge for practitioners implementing SRMs, and for decision-makers who might come to rely on them. Our work highlights a need for fine-grained benchmarks captured from diverse subpopulations that test not only fidelity but generalization.

Assessing Generalization for Subpopulation Representative Modeling via In-Context Learning

TL;DR

The paper investigates how well LLM-based Subpopulation Representative Models (SRMs) generalize beyond the conditioning data used in in-context learning, focusing on generalization across response variables and demographics using ANES data. It formalizes fidelity with and , and evaluates performance using prompting with gpt-3.5-turbo under zero-shot and few-shot conditions. The results show that fidelity generally improves with more conditioning variables and more few-shot examples, but the degree of improvement is highly heterogeneous across demographic groups, with some groups benefitting little or even experiencing worse fidelity. These findings highlight ethical and practical concerns for deploying SRMs, underscoring the need for fine-grained benchmarks and strategies to ensure equitable generalization across subpopulations in political and social science contexts.

Abstract

This study evaluates the ability of Large Language Model (LLM)-based Subpopulation Representative Models (SRMs) to generalize from empirical data, utilizing in-context learning with data from the 2016 and 2020 American National Election Studies. We explore generalization across response variables and demographic subgroups. While conditioning with empirical data improves performance on the whole, the benefit of in-context learning varies considerably across demographics, sometimes hurting performance for one demographic while helping performance for others. The inequitable benefits of in-context learning for SRM present a challenge for practitioners implementing SRMs, and for decision-makers who might come to rely on them. Our work highlights a need for fine-grained benchmarks captured from diverse subpopulations that test not only fidelity but generalization.
Paper Structure (21 sections, 2 equations, 25 figures, 1 table)

This paper contains 21 sections, 2 equations, 25 figures, 1 table.

Figures (25)

  • Figure 1: Description of a prompting strategy used for both RQ 1 and RQ 2. For Study 1, $|D_{f s}| = 0$.
  • Figure 2: Changes in the fidelity error depending on the $|V_{bc}|$ averaged across all demographics. The fidelity decreases as the number of conditioning variables increases. This pattern holds for every number of few-shot examples checked.
  • Figure 3: Changes in the fidelity error ($E$) depending on the number of conditioning variables ($|V_{bc}|$) for different racial groups. Error rates are lower in general for non-Hispanic Whites than for other racial groups.
  • Figure 4: Changes in the fidelity error ($E$) depending on the number of conditioning variables ($|V_{bc}|$) for different political parties. Error rates are lower in general for Democrats than for Republicans.
  • Figure 5: Changes in the fidelity error ($E$) depending on the number of few-shot examples ($|D_{fs}|$) for different racial groups. Error rates are lower for non-Hispanic Whites. While with increased number of few-shot examples the fidelity error for other race groups remain nearly constant, the fidelity rate for non-Hispanic white racial group decreases.
  • ...and 20 more figures