Table of Contents
Fetching ...

Large language models that replace human participants can harmfully misportray and flatten identity groups

Angelina Wang, Jamie Morgenstern, John P. Dickerson

TL;DR

Overall, caution is urged in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand, and in cases where the benefits of LLM replacement are determined to outweigh the harms.

Abstract

Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains -- including as replacements for human participants in computational social science, user testing, annotation tasks, and more. In many settings, researchers seek to distribute their surveys to a sample of participants that are representative of the underlying human population of interest. This means in order to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are likely to both misportray and flatten the representations of demographic groups, then empirically show this on 4 LLMs through a series of human studies with 3200 participants across 16 demographic identities. We also discuss a third limitation about how identity prompts can essentialize identities. Throughout, we connect each limitation to a pernicious history of epistemic injustice against the value of lived experiences that explains why replacement is harmful for marginalized demographic groups. Overall, we urge caution in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the benefits of LLM replacement are determined to outweigh the harms (e.g., the goal is to supplement rather than fully replace, engaging human participants may cause them harm), we provide inference-time techniques that we empirically demonstrate do reduce, but do not remove, these harms.

Large language models that replace human participants can harmfully misportray and flatten identity groups

TL;DR

Overall, caution is urged in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand, and in cases where the benefits of LLM replacement are determined to outweigh the harms.

Abstract

Large language models (LLMs) are increasing in capability and popularity, propelling their application in new domains -- including as replacements for human participants in computational social science, user testing, annotation tasks, and more. In many settings, researchers seek to distribute their surveys to a sample of participants that are representative of the underlying human population of interest. This means in order to be a suitable replacement, LLMs will need to be able to capture the influence of positionality (i.e., relevance of social identities like gender and race). However, we show that there are two inherent limitations in the way current LLMs are trained that prevent this. We argue analytically for why LLMs are likely to both misportray and flatten the representations of demographic groups, then empirically show this on 4 LLMs through a series of human studies with 3200 participants across 16 demographic identities. We also discuss a third limitation about how identity prompts can essentialize identities. Throughout, we connect each limitation to a pernicious history of epistemic injustice against the value of lived experiences that explains why replacement is harmful for marginalized demographic groups. Overall, we urge caution in use cases where LLMs are intended to replace human participants whose identities are relevant to the task at hand. At the same time, in cases where the benefits of LLM replacement are determined to outweigh the harms (e.g., the goal is to supplement rather than fully replace, engaging human participants may cause them harm), we provide inference-time techniques that we empirically demonstrate do reduce, but do not remove, these harms.
Paper Structure (7 sections, 19 figures, 2 tables)

This paper contains 7 sections, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Summary. We consider four possible reasons for prompting an LLM with a demographic identity: when the answer is contingent on identity membership, when identity is relevant to the answer, when the answer is subjective in a way where identity might play a role, and where identity is intended to increase response coverage. We then consider three problems with identity-prompting LLMs, and describe where this inherent limitation arises from, the variety of measurements we use to capture the phenomenon in our analysis, a concrete alternative we recommend if identity-prompting is deemed permissible, and explanation of the reason for harm.
  • Figure 2: LLMs compared to out-group imitations and in-group portrayals. Across three sets of reasons (columns), each point indicates the t-statistic of GPT-4's similarity to out-group imitations (positive value) compared to in-group portrayals (negative value) for one question across 100 samples. Columns have different numbers of questions (e.g., two per R2-Relevant and three per R3-Subjective). Each color indicates a different axis of identity, and the x-axis is the t-statistic. Circles indicate statistical significance with $p<.05$ and crosses indicate otherwise. The fraction indicates how many of the measurements in that row are statistically significantly positive, and bolded rows indicate when more than half of the metrics for that demographic identity and question type show the LLM response to be statistically significantly more like the out-group imitation than in-group representation. Overall we see that on R1-Contingent and R2-Relevant, identities like non-binary person and person with impaired vision are consistently more like out-group imitations. R3-Subjective shows smaller effects.
  • Figure 3: Identity-coded names compared to explicit identity label. Same interpretation as Fig. \ref{['fig:outin_gpt4']} where the t-statistic is shown, where positive values for each of the six metrics indicate the LLM response is more similar to out-group imitations than in-group representations. All shown values are statistically significant, and squares indicate when the explicity identity label is prompted (Ident), circles indicating one of the two idnetity-coded names (Name 0 or Name 1). Identity-coded names tend to generate more in-group-aligned portrayals than do explicit identity labels, as shown by more negative values.
  • Figure 4: LLMs flatten groups. Across all four LLMs (rows), each point indicates the diversity measurement averaged across 3-6 questions asked for each identity. There are 100 samples per question, and 95% confidence intervals are generated through cluster bootstrapping with each question as a cluster. Each column represents a different measure of diversity, and the larger the number on the x-axis, the more diverse the responses are. The gray crosses indicate human participant in-group responses, while colored circles represent LLM responses. Nearly every single model and identity group across each metric has less diverse LLM responses compared to human responses.
  • Figure 5: Temperature hyperparameter does not solve flatness for GPT-4. Comparison of human in-group diversity to GPT-4 generations with varying levels of temperature settings, where by 1.4 the responses become incoherent. There are 100 responses at each setting, and 95% confidence intervals are shown. At this setting even though the unique n-gram metric shows GPT-4 surpassing humans in diversity, this is only due to the incoherence as under no other semantic metric is human diversity reached.
  • ...and 14 more figures