Table of Contents
Fetching ...

Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

Eileen Pan, Anna Seo Gyeong Choi, Maartje ter Hoeve, Skyler Seto, Allison Koenecke

TL;DR

The paper investigates dialectal biases in LLMs by perturbing standard SAE QA prompts into six real-world dialects using Multi-VALUE and evaluating three models on BoolQ, SciQ, and MMLU. It demonstrates pervasive performance degradation for non-SAE dialects, with Singaporean English and AAVE showing the largest drops, and links much of this degradation to a small set of high-impact grammatical rules. By decomposing dialectal effects down to individual grammar rules, the study reveals that rules such as existential it, drop copula, and y'all drive substantial accuracy losses and exhibit interaction effects in some dialects. These findings highlight specific targets for bias mitigation and suggest data-augmentation or targeted training strategies to improve multi-dialect robustness in knowledge and reasoning benchmarks.

Abstract

Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.

Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

TL;DR

The paper investigates dialectal biases in LLMs by perturbing standard SAE QA prompts into six real-world dialects using Multi-VALUE and evaluating three models on BoolQ, SciQ, and MMLU. It demonstrates pervasive performance degradation for non-SAE dialects, with Singaporean English and AAVE showing the largest drops, and links much of this degradation to a small set of high-impact grammatical rules. By decomposing dialectal effects down to individual grammar rules, the study reveals that rules such as existential it, drop copula, and y'all drive substantial accuracy losses and exhibit interaction effects in some dialects. These findings highlight specific targets for bias mitigation and suggest data-augmentation or targeted training strategies to improve multi-dialect robustness in knowledge and reasoning benchmarks.

Abstract

Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.

Paper Structure

This paper contains 12 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Grammatical rules are applied to full QA datasets one at a time; the top five rules by accuracy reduction are shown for each facet (on the subset of questions for which the grammatical rule can be applied, and for which the LLM answered correctly when asked in SAE). Accuracy difference refers to the comparison between the original QA dataset and the QA dataset having applied only a single grammatical rule. Grammatical rule definitions are provided in Table \ref{['tab:grammar_table']}.
  • Figure 2: Breakdown of the extent to which accuracy decreases can be attributed to subsets of dialectal grammatical rules. We consider the same data subset--corresponding to the n samples LLMs answered correctly in SAE where the grammar rule is applicable--in each group of bars. We denote the percentage of overall dialectal performance degradation (All Dialect Rules) recovered by just one rule obligatory for that dialect (Rule of Interest) and all rules obligatory (Obligatory Dialect Rules) within the respective bars. Abbreviations are: African American English (UAAVE), Singaporean English (CollSgE), Appalachian English (AppE), and Southern English (SEAmE).
  • Figure 3: Breakdown of the extent to which overall accuracy decreases can be attributed to obligatory grammatical rules compared to all dialect rules. Abbreviations are: African American English (UAAVE), Singaporean English (CollSgE), Indian English (IndE), Appalachian English (AppE), Chicano English (ChcE), and Southern English (SEAmE).