Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks
Eileen Pan, Anna Seo Gyeong Choi, Maartje ter Hoeve, Skyler Seto, Allison Koenecke
TL;DR
The paper investigates dialectal biases in LLMs by perturbing standard SAE QA prompts into six real-world dialects using Multi-VALUE and evaluating three models on BoolQ, SciQ, and MMLU. It demonstrates pervasive performance degradation for non-SAE dialects, with Singaporean English and AAVE showing the largest drops, and links much of this degradation to a small set of high-impact grammatical rules. By decomposing dialectal effects down to individual grammar rules, the study reveals that rules such as existential it, drop copula, and y'all drive substantial accuracy losses and exhibit interaction effects in some dialects. These findings highlight specific targets for bias mitigation and suggest data-augmentation or targeted training strategies to improve multi-dialect robustness in knowledge and reasoning benchmarks.
Abstract
Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.
