Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein
TL;DR
The paper investigates linguistic bias in ChatGPT across ten English dialects, revealing that standard varieties (SAE/SBE) are treated as the default and minoritized varieties face harms such as stereotyping, demeaning content, and reduced comprehension. Using a two-study design, it analyzes linguistic feature retention in default model outputs and conducts native-speaker evaluations, including scenarios where the model imitates input dialects. Findings show GPT-3.5 retains SAE/SBE features at high rates, while minoritized varieties see far lower feature retention; native speakers report harms that intensify when models imitate dialects, with GPT-4 improving some dimensions but increasing stereotyping. The work highlights potential perpetuation of linguistic discrimination in widely used language models and underscores the need for mitigation to improve accessibility and equity for speakers of minoritized dialects. It also outlines limitations related to data sources, sampling, and ethics, pointing to avenues for broader linguistic evaluation and bias reduction in future work.
Abstract
We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: stereotyping (19% worse than for "standard" varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-"standard" varieties.
