Table of Contents
Fetching ...

Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

Myra Cheng, Esin Durmus, Dan Jurafsky

TL;DR

Marked Personas introduces a lexicon-free, prompt-based framework built on markedness to measure stereotypes in LLM outputs across intersectional groups. It combines two components—generating natural-language personas and extracting distinguishing words (Marked Words) via weighted log-odds and Dirichlet priors, with robustness checks using SVM and Jensen-Shannon Divergence. The study finds that GPT-4 and GPT-3.5 produce more stereotyped portrayals than human-written ones, uncovering pernicious patterns such as othering, essentialism, tropicalism, and resilience tropes, with concrete implications for downstream story generation. The work advocates an intersectional, transparent approach to bias mitigation and highlights limitations and cultural scope, suggesting directions for more comprehensive and accountable stereotype measurement in LLMs.

Abstract

To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence and nuances of stereotypes in LLM outputs. Toward this end, we present Marked Personas, a prompt-based method to measure stereotypes in LLMs for intersectional demographic groups without any lexicon or data labeling. Grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an LLM to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones. We find that the portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. The words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. An intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. These representational harms have concerning implications for downstream applications like story generation.

Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

TL;DR

Marked Personas introduces a lexicon-free, prompt-based framework built on markedness to measure stereotypes in LLM outputs across intersectional groups. It combines two components—generating natural-language personas and extracting distinguishing words (Marked Words) via weighted log-odds and Dirichlet priors, with robustness checks using SVM and Jensen-Shannon Divergence. The study finds that GPT-4 and GPT-3.5 produce more stereotyped portrayals than human-written ones, uncovering pernicious patterns such as othering, essentialism, tropicalism, and resilience tropes, with concrete implications for downstream story generation. The work advocates an intersectional, transparent approach to bias mitigation and highlights limitations and cultural scope, suggesting directions for more comprehensive and accountable stereotype measurement in LLMs.

Abstract

To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence and nuances of stereotypes in LLM outputs. Toward this end, we present Marked Personas, a prompt-based method to measure stereotypes in LLMs for intersectional demographic groups without any lexicon or data labeling. Grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an LLM to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones. We find that the portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. The words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. An intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. These representational harms have concerning implications for downstream applications like story generation.
Paper Structure (34 sections, 7 figures, 18 tables)

This paper contains 34 sections, 7 figures, 18 tables.

Figures (7)

  • Figure 1: Average percentage of words across personas that are in the Black and White stereotype lexicons. Error bar denotes standard error. Generated portrayals (blue) contain more stereotypes than human-written ones (green). For GPT-3.5, generated white personas contain more Black stereotype lexicon words than generated Black personas.
  • Figure 2: Percentage of personas that contain stereotype lexicon words. On the $x$-axis, lexicon words that do not occur in the generated personas (ghetto, unrefined, criminal, gangster, poor, unintelligent, uneducated, dangerous, vernacular, violent and lazy) are subsumed into "other words." Generated personas contain more Black-stereotypical words, but only the ones that are nonnegative in sentiment. For GPT-3.5, white personas have higher rates of stereotype lexicon words, thus motivating an unsupervised measure of stereotypes.
  • Figure 3: Percentage of personas that contain resilient and resilience. Occurrences of resilient and resilience across generated personas reveal that these terms are primarily used in descriptions of Black women and other women of color. Groups where these words occur in $<10\%$ of personas across models are subsumed into "other groups." We observe similar trends for other models (Appendix \ref{['allresults']}).
  • Figure A1: Percentage of racial and ethnic stereotypes in portrayals of different groups. For Asian, White, and Middle-Eastern stereotypes, the corresponding portrayals exhibit the highest rates of those stereotypes. Rates of stereotypes are generally lower in text-davinci-003 portrayals than text-davinci-002 portrayals.
  • Figure A2: Average percentage of words across personas that are in the Black and White stereotype lexicons. Error bar denotes standard error. Portrayals by ChatGPT (blue) contain more stereotypes than human-written ones (green). Like GPT-3.5, the rates of Black stereotypical words are higher in the generated white personas than the generated black ones.
  • ...and 2 more figures