Table of Contents
Fetching ...

Dialect prejudice predicts AI decisions about people's character, employability, and criminality

Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King

TL;DR

The paper reveals that language models harbor covert racism in the form of dialect prejudice toward African American English, manifesting through raciolinguistic stereotypes activated by dialect features rather than overt race mentions. Using Matched Guise Probing across multiple models and tasks, the study shows that covert stereotypes align with archaic pre-Civil Rights-era human stereotypes and can drive detrimental judgments in employment and criminal-justice–related scenarios. Importantly, standard bias-mitigation approaches like scaling or human-feedback training fail to eliminate this covert prejudice and may even widen the gap between covert and overt stereotypes. The work highlights a pressing need to rethink fairness and safety in language technology beyond traditional overt-bias mitigation, given the real-world harms implied by dialect prejudice in AI systems.

Abstract

Hundreds of millions of people now interact with language models, with uses ranging from serving as a writing aid to informing hiring decisions. Yet these language models are known to perpetuate systematic racial prejudices, making their judgments biased in problematic ways about groups like African Americans. While prior research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time. It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice: we extend research showing that Americans hold raciolinguistic stereotypes about speakers of African American English and find that language models have the same prejudice, exhibiting covert stereotypes that are more negative than any human stereotypes about African Americans ever experimentally recorded, although closest to the ones from before the civil rights movement. By contrast, the language models' overt stereotypes about African Americans are much more positive. We demonstrate that dialect prejudice has the potential for harmful consequences by asking language models to make hypothetical decisions about people, based only on how they speak. Language models are more likely to suggest that speakers of African American English be assigned less prestigious jobs, be convicted of crimes, and be sentenced to death. Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level. Our findings have far-reaching implications for the fair and safe employment of language technology.

Dialect prejudice predicts AI decisions about people's character, employability, and criminality

TL;DR

The paper reveals that language models harbor covert racism in the form of dialect prejudice toward African American English, manifesting through raciolinguistic stereotypes activated by dialect features rather than overt race mentions. Using Matched Guise Probing across multiple models and tasks, the study shows that covert stereotypes align with archaic pre-Civil Rights-era human stereotypes and can drive detrimental judgments in employment and criminal-justice–related scenarios. Importantly, standard bias-mitigation approaches like scaling or human-feedback training fail to eliminate this covert prejudice and may even widen the gap between covert and overt stereotypes. The work highlights a pressing need to rethink fairness and safety in language technology beyond traditional overt-bias mitigation, given the real-world harms implied by dialect prejudice in AI systems.

Abstract

Hundreds of millions of people now interact with language models, with uses ranging from serving as a writing aid to informing hiring decisions. Yet these language models are known to perpetuate systematic racial prejudices, making their judgments biased in problematic ways about groups like African Americans. While prior research has focused on overt racism in language models, social scientists have argued that racism with a more subtle character has developed over time. It is unknown whether this covert racism manifests in language models. Here, we demonstrate that language models embody covert racism in the form of dialect prejudice: we extend research showing that Americans hold raciolinguistic stereotypes about speakers of African American English and find that language models have the same prejudice, exhibiting covert stereotypes that are more negative than any human stereotypes about African Americans ever experimentally recorded, although closest to the ones from before the civil rights movement. By contrast, the language models' overt stereotypes about African Americans are much more positive. We demonstrate that dialect prejudice has the potential for harmful consequences by asking language models to make hypothetical decisions about people, based only on how they speak. Language models are more likely to suggest that speakers of African American English be assigned less prestigious jobs, be convicted of crimes, and be sentenced to death. Finally, we show that existing methods for alleviating racial bias in language models such as human feedback training do not mitigate the dialect prejudice, but can exacerbate the discrepancy between covert and overt stereotypes, by teaching language models to superficially conceal the racism that they maintain on a deeper level. Our findings have far-reaching implications for the fair and safe employment of language technology.
Paper Structure (36 sections, 7 equations, 24 figures, 28 tables)

This paper contains 36 sections, 7 equations, 24 figures, 28 tables.

Figures (24)

  • Figure 1: Basic functioning of Matched Guise Probing. a: We draw upon texts in AAE (blue) and SAE (green). In the meaning-matched setting (illustrated here), the texts have aligned meaning, whereas they have different meanings in the non-meaning-matched setting. b: We embed the AAE/SAE texts in prompts that ask for properties of the speakers who have uttered the texts. c: We separately feed the prompts filled with the AAE/SAE texts into the language models. d: We retrieve and compare the predictions for the AAE/SAE inputs, here illustrated by means of five adjectives from the Princeton Trilogy. See Methods (\ref{['m:probing']}) for more details.
  • Figure 2: Agreement of stereotypes about African Americans in humans and (overt and covert) stereotypes about African Americans in language models. The black dotted line shows chance agreement based on a random bootstrap. Error bars represent the standard error across different language models, model versions, settings, and prompts. While the language models' overt stereotypes agree most strongly with current human stereotypes, which are the most positive experimentally recorded ones, their covert stereotypes agree most strongly with human stereotypes from the 1930s, which are the most negative experimentally recorded ones.
  • Figure 3: Stereotype strength for individual linguistic features of AAE. Error bars represent the standard error across different language models/model versions and prompts. The examined linguistic features are: use of invariant be for habitual aspect; use of finna as a marker of the immediate future; use of (unstressed) been for SAE has been/have been (i.e., present perfects); absence of copula is and are for present tense verbs; use of ain't as a general preverbal negator; orthographic realization of word-final -ing as -in; use of invariant stay for intensified habitual aspect; inflection absence in the third person singular present tense. The measured stereotype strength is significantly above zero for all examined linguistic features, indicating that they all evoke raciolinguistic stereotypes in language models. At the same time, there is a lot of variation between individual features. See the Supplementary Information (\ref{['si:features']}) for more details and analyses.
  • Figure 4: Association of different occupations with AAE vs. SAE. Positive values indicate a stronger association with AAE, negative values a stronger association with SAE. While the bottom five occupations (i.e., occupations associated most strongly with SAE) mostly require a university degree, this is not the case for the top five occupations (i.e., occupations associated most strongly with AAE).
  • Figure 5: Prestige of occupations that language models associate with AAE (positive values) vs. SAE (negative values). The shaded area shows a 95% confidence band. The association with AAE vs. SAE predicts occupational prestige. Results for individual language models are provided in the Extended Data (Figure \ref{['ed:employability_model_plots']}).
  • ...and 19 more figures