Table of Contents
Fetching ...

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein

TL;DR

The paper investigates linguistic bias in ChatGPT across ten English dialects, revealing that standard varieties (SAE/SBE) are treated as the default and minoritized varieties face harms such as stereotyping, demeaning content, and reduced comprehension. Using a two-study design, it analyzes linguistic feature retention in default model outputs and conducts native-speaker evaluations, including scenarios where the model imitates input dialects. Findings show GPT-3.5 retains SAE/SBE features at high rates, while minoritized varieties see far lower feature retention; native speakers report harms that intensify when models imitate dialects, with GPT-4 improving some dimensions but increasing stereotyping. The work highlights potential perpetuation of linguistic discrimination in widely used language models and underscores the need for mitigation to improve accessibility and equity for speakers of minoritized dialects. It also outlines limitations related to data sources, sampling, and ethics, pointing to avenues for broader linguistic evaluation and bias reduction in future work.

Abstract

We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: stereotyping (19% worse than for "standard" varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-"standard" varieties.

Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination

TL;DR

The paper investigates linguistic bias in ChatGPT across ten English dialects, revealing that standard varieties (SAE/SBE) are treated as the default and minoritized varieties face harms such as stereotyping, demeaning content, and reduced comprehension. Using a two-study design, it analyzes linguistic feature retention in default model outputs and conducts native-speaker evaluations, including scenarios where the model imitates input dialects. Findings show GPT-3.5 retains SAE/SBE features at high rates, while minoritized varieties see far lower feature retention; native speakers report harms that intensify when models imitate dialects, with GPT-4 improving some dimensions but increasing stereotyping. The work highlights potential perpetuation of linguistic discrimination in widely used language models and underscores the need for mitigation to improve accessibility and equity for speakers of minoritized dialects. It also outlines limitations related to data sources, sampling, and ethics, pointing to avenues for broader linguistic evaluation and bias reduction in future work.

Abstract

We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: stereotyping (19% worse than for "standard" varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-"standard" varieties.
Paper Structure (38 sections, 9 figures, 6 tables)

This paper contains 38 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Sample model responses (top) and native speaker reactions to model responses (bottom).
  • Figure 2: Estimated maximum speaker population vs. retention rate for minoritized varieties.
  • Figure 3: Change in % of examples using British, American, or either orthographic style from inputs to outputs.
  • Figure 4: Average response ratings by variety (5-point scale). Red titles indicate negative qualities, green indicates positive, and yellow indicates neutral. Gray horizontal lines are 95% confidence intervals. The orange dotted line is the average for the standard varieties (SAE and SBE) for ease of comparison. Responses to minoritized varieties (blue) were rated as worse in terms of stereotyping (19% gap), demeaning content (25%), comprehension (9%), naturalness (8%), and condescension (15%).
  • Figure 5: Top: Change in average ratings for each variety from GPT-3.5 responses that do not imitate the input variety to GPT-3.5 responses that do. Bottom: Change in ratings from GPT-3.5 responses that imitate the input variety to GPT-4 responses that imitate the input variety.
  • ...and 4 more figures